Pubm Ume r18 Alarm-Handling en

ElasticNet UME R18
Unified Management Expert System

Alarm Handling Reference
Version: V16.19.40
ZTE CORPORATION
No. 55, Hi-tech Road South, ShenZhen, P.R.China
Postcode: 518057
Tel: +86-755-26771900
URL: http://support.zte.com.cn
E-mail: support@zte.com.cn
LEGAL INFORMATION
Copyright 2020 ZTE CORPORATION.
The contents of this document are protected by copyright laws and international treaties. Any reproduction
or distribution of this document or any portion of this document, in any form by any means, without the
prior written consent of ZTE CORPORATION is prohibited. Additionally, the contents of this document
are protected by contractual confidentiality obligations.
All company, brand and product names are trade or service marks, or registered trade or service marks,
of ZTE CORPORATION or of their respective owners.
This document is provided as is, and all express, implied, or statutory warranties, representationsor
conditions are disclaimed, including without limitation any implied warranty of merchantability, fitness for
a particular purpose, title or non-infringement. ZTE CORPORATION and its licensors shall not be liable
for damages resulting from the use of or reliance on the information contained herein.
ZTE CORPORATION or its licensors may have current or pending intellectual property rights or
applications covering the subject matter of this document. Except as expressly provided in any written
license between ZTE CORPORATION and its licensee, the user of this document shall not acquire any
license to the subject matter herein.
ZTE CORPORATION reserves the right to upgrade or make technical change to this product without
further notice.
Users may visit the ZTE technical support website http://support.zte.com.cn to inquire for related
information.
The ultimate right to interpret this product resides in ZTE CORPORATION.
Revision History
Revision No. Revision Date Revision Reason

R1.0 2020-01-10 First edition.
Serial Number: SJ-20191220164305-036
Publishing Date: 2020-01-10 (R1.0)

Contents
1 Equipment Alarm.......................................................................................1-1
1.1 0001 Container Startup Failed..............................................................................1-1
1.2 0011 Node Heartbeat lost..................................................................................... 1-2
1.3 0051 Hardware performance threshold exceeded................................................1-3
1.4 0069 Hardware Alarm........................................................................................... 1-4
1.5 3003 Component Instance State Exception......................................................... 1-4
1.6 7004 NFS Shared Volume Multi-Mounted In The Cluster.....................................1-5
1.7 7005 Read-Only NFS Shared Volume..................................................................1-5
1.8 9031 One Port of the Bond Group Fault.............................................................. 1-5
1.9 9032 All Ports of the Bond Group Fault............................................................... 1-6
1.10 9033 OVS Service Fault..................................................................................... 1-6
1.11 9321 Platform pg database is unusable............................................................. 1-7
1.12 9322 Plateform pg node instance is abnormal................................................... 1-7
1.13 9323 pacemaker cluster heartbeat is abnormal..................................................1-8
2 QoS Alarm..................................................................................................2-1
2.1 0002 Node CPU usage too high.......................................................................... 2-2
2.2 0003 Too High Memory Usage of the Node......................................................... 2-3
2.3 0004 Too High CPU Usage of the Component.................................................... 2-4
2.4 0005 Too High Memory Usage of the Component............................................... 2-5
2.5 0009 Node disk usage too high............................................................................2-5
2.6 0015 Minion Cpu Allocation Ratio Too High......................................................... 2-6
2.7 0017 Minion Memory Allocation Ratio Too High...................................................2-7
2.8 0018 Minion Filesystem Usage Rate Too High.................................................... 2-7
2.9 0019 Node disk partition usage too high..............................................................2-8
2.10 0030 The time offset from the NTP server is too large.......................................2-9
2.11 0031 Too High CPU Iowait Usage of the Node................................................ 2-10
2.12 0032 Too High CPU Usage of Steal.................................................................2-11
2.13 0034 Too High PID Usage of the Node............................................................ 2-12
2.14 0035 Too High System Load per Minute.......................................................... 2-13
2.15 0036 Node can not synchronize time with NTP server.....................................2-14
2.16 0037 Node Network Rx Rate Too High............................................................ 2-15
2.17 0038 Node Network Tx Rate Too High.............................................................2-17
2.18 0039 NTP service exit...................................................................................... 2-18
I
2.19 0050 Business performance threshold exceeded.............................................2-18
2.20 0052 Middleware performance threshold exceeded......................................... 2-19
2.21 0053 Framework performance threshold exceeded......................................... 2-19
2.22 0054 self-maintain performance threshold exceeded....................................... 2-20
2.23 1513 Performance threshold alarm...................................................................2-20
2.24 3001 Abnormal Minion Status...........................................................................2-21
2.25 3002 Cluster status abnormal...........................................................................2-22
2.26 4002 Insufficient Tenant Quota......................................................................... 2-23
2.27 5001 Certificate Will Expire Soon..................................................................... 2-24
2.28 5002 Certificate Expired....................................................................................2-24
2.29 5101 Create Project Failed............................................................................... 2-24
2.30 5102 Delete Project Failed............................................................................... 2-25
2.31 7001 Storage cluster status abnormal.............................................................. 2-26
2.32 7002 Cluster Capacity Usage Exceeded the Threshold................................... 2-27
2.33 7003 Volume Capacity Usage Exceeded the Threshold...................................2-27
2.34 8501 NBM Initialization Failed.......................................................................... 2-28
2.35 9101 PostgreSQL database cluster unavailable...............................................2-28
2.36 9102 PostgreSQL database cluster contains unavailable nodes...................... 2-29
2.37 9103 PostgreSQL database cluster replication interrupts or produces brain-
split..................................................................................................................... 2-30
2.38 9104 PostgreSQL database master and standby cluster replication
interruption..........................................................................................................2-31
2.39 9105 PostgreSQL database failed to archive log file........................................2-32
2.40 9141 Index is damaged in common service Elasticsearch............................... 2-33
3 OMC Alarm.................................................................................................3-1
3.1 1000 User locked.................................................................................................. 3-2
3.2 1001 Hard disk usage of database server overload............................................. 3-2
3.3 1002 CPU usage of application server overload.................................................. 3-3
3.4 1003 RAM usage of application server overload..................................................3-3
3.5 1004 Application server disk-overload.................................................................. 3-3
3.6 1008 Database instance space usage too large.................................................. 3-4
3.7 1012 License is expired........................................................................................3-4
3.8 1013 License is about to expire........................................................................... 3-5
3.9 1015 The link between the server and the ME agent is broken........................... 3-5
3.10 1017 The time in which the designated alarm remains active has expired......... 3-6
3.11 1018 The time in which the designated alarm remains unacknowledged has
expired.................................................................................................................. 3-6
II
3.12 1022 Merge rule root alarm................................................................................ 3-7
3.13 1023 Suppress plan task.................................................................................... 3-7
3.14 1025 Automatic backup failure........................................................................... 3-8
3.15 1028 Alarm forwarding failure.............................................................................3-8
3.16 1034 License consumption exceeds the alarm threshold................................... 3-9
3.17 1035 License consumption exceeds the total authorization............................... 3-9
3.18 1050 Wrong login password............................................................................... 3-9
3.19 1060 The number of users assigned the specific type exceeds the limit.......... 3-10
3.20 1061 The number of users assigned the specific type is about to exceed the
limit..................................................................................................................... 3-10
3.21 1300 Password has expired............................................................................. 3-11
3.22 1301 Password will expire................................................................................ 3-11
3.23 1310 The number of login users exceeds the limit...........................................3-11
3.24 1311 SNMP authentication failure.....................................................................3-12
4 Communication Alarm.............................................................................. 4-1
4.1 1014 The link between the server and the ME is broken..................................... 4-1
4.2 1040 ME or agent backend start failure............................................................... 4-1
4.3 200204012 S1 link is broken................................................................................ 4-2
4.4 200204013 Power supply failure.......................................................................... 4-2
4.5 200204014 Transport failure................................................................................. 4-3
5 Processing Error Alarm............................................................................5-1
5.1 0502 K8s schedule failed..................................................................................... 5-1
5.2 0503 K8s create pod failed...................................................................................5-3
5.3 0504 Failed to Delete a Pod................................................................................ 5-3
5.4 1014 Abnormal Service Operational Status..........................................................5-4
5.5 1015 Abnormal Mircroservice Operational Status................................................ 5-5
5.6 2001 Add network for Pod error........................................................................... 5-6
5.7 2002 IaaS account authentication failed...............................................................5-7
5.8 8001 Commonservice deployed failed..................................................................5-7
5.9 9302 Failed to synchronize data to slave zone.................................................... 5-8
6 Environment Alarm................................................................................... 6-1
6.1 9121 FTP disk space is insufficient...................................................................... 6-1
6.2 9122 FTP disk read and write exception..............................................................6-2
6.3 9201 Common Service Kafka node is offline....................................................... 6-2
6.4 9301 The connection for geographical disaster recovery is broken......................6-3
7 Integrity Violation Alarm...........................................................................7-1
7.1 15010001 Alarm for Missing of PM Data............................................................. 7-1
III
7.2 15010002 Alarm for Missing of NAF PM Data..................................................... 7-1
Glossary............................................................................................................. I
IV
About This Manual
Purpose
The ElasticNet UME R18 (hereinafter referred to as the UME) is a RAN element
management system.
This manual provides a reference for alarms related to the UME system. For alarms
related to a specific NE, refer to the corresponding user manual of the NE.
Intended Audience
This manual is intended for:

 Maintenance engineers
 Software debugging engineers
What Is in This Manual
This manual contains the following chapters.
Chapter 1, Equipment Alarm Provides a reference for equipment alarms related to the UME
system.
Chapter 2, QoS Alarm Provides a reference for QoS alarms related to the UME system.
Chapter 3, OMC Alarm Provides a reference for network management alarms related to the
UME system.
Chapter 4, Communication Provides a reference for communication alarms related to the UME
Alarm system.
Chapter 5, Processing Error Provides a reference for processing error alarms related to the UME
Alarm system.
Chapter 6, Environment Provides a reference for processing environment alarms related to the
Alarm UME system.
Chapter 7, Integrity Violation Provides a reference for integrity violation alarms related to the UME
Alarm system.
Related Documentation
The following documentation is related to this manual:
V
ElasticNet UME R18 Unified Management Expert System Alarm Management Operation
Guide
Conventions
This manual uses the following conventions.
Note: provides additional information about a topic.
VI
Chapter 1
Equipment Alarm
Table of Contents
0001 Container Startup Failed.......................................................................................1-1
0011 Node Heartbeat lost..............................................................................................1-2
0051 Hardware performance threshold exceeded........................................................ 1-3
0069 Hardware Alarm....................................................................................................1-4
3003 Component Instance State Exception.................................................................. 1-4
7004 NFS Shared Volume Multi-Mounted In The Cluster............................................. 1-5
7005 Read-Only NFS Shared Volume.......................................................................... 1-5
9031 One Port of the Bond Group Fault....................................................................... 1-5
9032 All Ports of the Bond Group Fault........................................................................ 1-6
9033 OVS Service Fault................................................................................................ 1-6
9321 Platform pg database is unusable........................................................................1-7
9322 Plateform pg node instance is abnormal..............................................................1-7
9323 pacemaker cluster heartbeat is abnormal............................................................ 1-8
1.1 0001 Container Startup Failed

Alarm Information
 Alarm code: 0001

 Alarm description: Container Startup Failed
 Alarm level: Major
 Alarm type: Equipment Alarm
Alarm Cause
Blueprint container images error

Network plug-in failure
Container runtime failure
Application error
SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-1

ElasticNet UME R18 Alarm Handling Reference
Action
1. On the Application Manager page, click an application to enter the details page.
Click the Alarm tab. View the current alarms and check whether there is the “Pod
network configuration failure” alarm.
a. Yes -> Handle the fault based on related handling suggestions.
b. No -> Step 2.
2. If this application is not deployed through blueprint, go to Step 3.
3. If this application is deployed through blueprint, check whether the blueprint
container images is correct.
4. Select Software Repository > Blueprint , and click a blueprint used by the
application. Click the used blueprint version. In the Action column, select Edit to
enter the blueprint editing page. Check whether the Pod container image name and
version number exist in the image repository.
a. No -> Modify the Pod container image and save it. In the Action column, click
Deploy to re-deploy the application and delete the original application.
b. Yes -> Step 3.
5. Check whether the application itself is abnormal.
6. On the Application Manager page, click Application Name to enter the microservice
page. Click Microservice Name to enter the Pod page. Click Pod Name to enter the
container page. Click the Container tab. Click Container Name to enter the container
details page. Select the Log tab. Check whether the application is abnormal in
accordance with the container logs. If there is no log or you cannot determine,
contact ZTE technical support.
1.2 0011 Node Heartbeat lost

Alarm Information

 Alarm description: Node Heartbeat lost
Alarm Cause
The nodes is faulty or the link is abnormal.
1-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

1 Equipment Alarm
Action
1. Log in to the PaaS Controller node, and execute ssh ubuntu@IP address of the
control node.
a. Successful login -> Step 2.
b. Login failure -> Step 4.
2. Log in to the abnormal node. Assume that the node name is default-
np-5-192.173.0.57.
a. You can log in to the node by executing ssh ubuntu@192.173.0.57.
b. Successful login -> Step 3.
c. Login failure -> Step 4.
3. Restart the heartbeat handshaking component.
a. Switch the root permission: Sudo su.
b. Execute the service heartbeat restart command. Wait for 5 minutes, and check
whether the alarm is cleared.
c. Yes -> End.
d. No -> Step 4.
4. Restart the abnormal node.
a. If Non preset senario,Then Select Resources > Compute > Nodes . In the search
box, enter '192.173.0.57' in the alarm object name to find the abnormal node.
Click the restart button of the node.
b. If Preset senario,Then Siwtch to the root permission:sudo su, and execute reboot
to reboot the abnormal node.
5. Check whether the alarm is cleared.
a. Yes -> End.
b. No -> Contact ZTE technical support.
1.3 0051 Hardware performance threshold exceeded

Alarm Information

 Alarm description: Hardware performance threshold exceeded
 Alarm level: Warn
SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-3

Alarm Cause
The hardware performance indicator exceeds the threshold within the specified
inspection period.
Action
1. Determine the corresponding hardware performance indicator item by alarm details.

2. Enter the the hardware operation and maintenance web interface related to this
hardware performance indicator and process it according to the prompt information.
3. Please contact the administrator if the problem persists, or contact the technical
support for other reasons.
1.4 0069 Hardware Alarm

Alarm Information

 Alarm description: Hardware Alarm
 Alarm level: Undefined
Alarm Cause
Hardware Alarm,Pelease Check Detail Information Field.
Action
Check the running status of the hardware equipment according to the alarm location
information. If there is no log or cannot be judged, contact ZTE technical support
personnel for processing.
1.5 3003 Component Instance State Exception

Alarm Information

 Alarm description: Component Instance State Exception
Alarm Cause
Component instance is running abnormally
1-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)

1 Equipment Alarm
Component instance is not running

Unable to get component instance running status
Action
Can not recover more than 10 minutes, please contact ZTE technical support staff.
1.6 7004 NFS Shared Volume Multi-Mounted In The Cluster

Alarm Information

 Alarm description: NFS Shared Volume Multi-Mounted In The Cluster
Alarm Cause
The NFS shared volume is concurrently mounted in the Multiple computers.
Action
Please contact ZTE technical support staff
1.7 7005 Read-Only NFS Shared Volume

Alarm Information

 Alarm description: Read-Only NFS Shared Volume
Alarm Cause
The NFS shared volume is mounted in the read-only status.
Action
Please contact ZTE technical support staff
1.8 9031 One Port of the Bond Group Fault

Alarm Information
SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-5

 Alarm description: One Port of the Bond Group Fault

Alarm Cause
The Ethernet interface of the bond group is faulty.
Action
Check 9031 alarm specific reasons, then solve the Ethernet interface faulty by the
People Repair.
1.9 9032 All Ports of the Bond Group Fault

Alarm Information

 Alarm description: All Ports of the Bond Group Fault
 Alarm level: Serious
Alarm Cause
All of the Ethernet interfaces of the bond group are faulty.
Action
Check 9032 alarm specific reasons, then solve the Ethernet interface faulty by the
People Repair.
1.10 9033 OVS Service Fault

Alarm Information

 Alarm description: OVS Service Fault
Alarm Cause
The service of openvswitch is faulty, or the process of ovsdb-server and ovs-vswitchd

are fault.
1-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)

1 Equipment Alarm
Action
Contact ZTE technical support.
1.11 9321 Platform pg database is unusable

Alarm Information

 Alarm description: Platform pg database is unusable
 Alarm level: Serious
Alarm Cause
Postgresql_ip resource is lost, probably caused by factors listed below:

 Postgresql_ip resource fail.
 Pg has no Master.
 Pg Master/Slave is changed.
Action
Can not recover more than 10 minutes, please contact ZTE technical support to repair
pg database failure.
1.12 9322 Plateform pg node instance is abnormal

Alarm Information

 Alarm description: Plateform pg node instance is abnormal
 Alarm level: Minor
Alarm Cause
Plateform pg node instance is abnormal, probably caused by factors listed below:

 Pg instance has errors.
 Pg instance is stopped manually.
Action
pgl database failure.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-7

1.13 9323 pacemaker cluster heartbeat is abnormal

Alarm Information

 Alarm description: pacemaker cluster heartbeat is abnormal
Alarm Cause
pacemaker cluster node is lost, probably caused by factors listed below:

 corosync process is killed by someone.
 corosync process listenning ip and corosync configured ip is nconsistency.
 pacemaker cluster node ip is unreachable.
 pacemaker cluster node interface MTU is less then 1500.
 pacemaker cluster nodes have network instability problems.
Action
pacemaker cluster node failure.
1-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Chapter 2
QoS Alarm
Table of Contents
0002 Node CPU usage too high................................................................................... 2-2
0003 Too High Memory Usage of the Node..................................................................2-3
0004 Too High CPU Usage of the Component............................................................. 2-4
0005 Too High Memory Usage of the Component........................................................ 2-5
0009 Node disk usage too high.................................................................................... 2-5
0015 Minion Cpu Allocation Ratio Too High..................................................................2-6
0017 Minion Memory Allocation Ratio Too High........................................................... 2-7
0018 Minion Filesystem Usage Rate Too High............................................................. 2-7
0019 Node disk partition usage too high.......................................................................2-8
0030 The time offset from the NTP server is too large................................................. 2-9
0031 Too High CPU Iowait Usage of the Node...........................................................2-10
0032 Too High CPU Usage of Steal........................................................................... 2-11
0034 Too High PID Usage of the Node.......................................................................2-12
0035 Too High System Load per Minute.....................................................................2-13
0036 Node can not synchronize time with NTP server............................................... 2-14
0037 Node Network Rx Rate Too High....................................................................... 2-15
0038 Node Network Tx Rate Too High....................................................................... 2-17
0039 NTP service exit................................................................................................. 2-18
0050 Business performance threshold exceeded....................................................... 2-18
0052 Middleware performance threshold exceeded....................................................2-19
0053 Framework performance threshold exceeded.................................................... 2-19
0054 self-maintain performance threshold exceeded..................................................2-20
1513 Performance threshold alarm............................................................................. 2-20
3001 Abnormal Minion Status..................................................................................... 2-21
3002 Cluster status abnormal......................................................................................2-22
4002 Insufficient Tenant Quota.................................................................................... 2-23
5001 Certificate Will Expire Soon................................................................................2-24
5002 Certificate Expired.............................................................................................. 2-24
5101 Create Project Failed..........................................................................................2-24
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-1

5102 Delete Project Failed.......................................................................................... 2-25

7001 Storage cluster status abnormal.........................................................................2-26
7002 Cluster Capacity Usage Exceeded the Threshold..............................................2-27
7003 Volume Capacity Usage Exceeded the Threshold............................................. 2-27
8501 NBM Initialization Failed..................................................................................... 2-28
9101 PostgreSQL database cluster unavailable......................................................... 2-28
9102 PostgreSQL database cluster contains unavailable nodes................................ 2-29
9103 PostgreSQL database cluster replication interrupts or produces brain-split....... 2-30
9104 PostgreSQL database master and standby cluster replication interruption........ 2-31
9105 PostgreSQL database failed to archive log file.................................................. 2-32
9141 Index is damaged in common service Elasticsearch..........................................2-33
2.1 0002 Node CPU usage too high

Alarm Information

 Alarm description:
This alarm is raised when the CPU of the node is overloaded.
As the CPU usage increases, the alarm level is dynamically adjusted:
→ When the usage reaches 75%, a warning alarm is raised.
→ When the usage reaches 80%, a minor alarm is raised.
→ When the usage reaches 85%, a major alarm is raised.
→ When the usage reaches 90%, a critical alarm is raised.
 Alarm type: QoS Alarm
Alarm Cause
Cpu usage rate of node exceed qos threshold
Action
1. Check the CPU usage trend of the node.

a. Select Monitor > Alarm > Current Alarm . Click the alarm name to enter the alarm
details page.
b. In the Detail Information box, click Go to check on the right of Performance Data .
c. Click the History Performance tab. Select a time period and view the trend graph
of CPU Usage Rate .
2-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
2. Check whether the CPU usage trend of the node is consistent with the service status
based on the analysis of the on-site services.
 Yes -> Step 3.
 No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
 Yes -> Step 4.
 No -> Solve the problems with the service.
4. Determine whether to increase the CPU usage threshold in accordance with the
service status.
 If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
 If the traffic increases in a long time, it is suggested to adjust the CPU usage QoS
threshold.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the CPU Usage Rate line and then modify the
thresholds at all levels.
2.2 0003 Too High Memory Usage of the Node

Alarm Information

 Alarm description: The memory usage of the node is too high, and alarms at different
levels are raised dynamically.
→ When the usage rate reaches 80%, a minor alarm is raised.
→ When the usage rate reaches 85%, a major alarm is raised.
→ When the usage rate is 90%, a critical alarm is raised.
Alarm Cause
The memory usage of the node exceeds the QoS threshold.
Action
1. View trends in CPU usage of nodes:
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-3

a. Login Portaladmin web -> "Monitor" -> "Alarm" -> "Current Alarm", click the alarm
name to enter the detail page.
b. Click "Go to check" to enter the node information page.
c. Click "History Performance", select a time tab to check the Memory usage.
2. Based on the analysis of the on-site business, confirm whether the Memory usage is
consistent with the business status.
Yes → Go to step 3.
No → Contact ZTE technical support.
3. Confirm that whether the service status is normal.
No → Solve the problems with the service.
4. Determine whether to increase the memory usage threshold in accordance with the
service status.
 If the traffic increases in a long time, it is suggested to adjust the memory usage
QoS threshold and go to Step 5.
node . Click the Modify button in the Memory Usage Rateline and then modify the
2.3 0004 Too High CPU Usage of the Component

Alarm Information

The CPU usage of the component is too high, and alarms at different levels are
raised dynamically.
Alarm Cause
The CPU usage of the component exceeds the QoS threshold.
2-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
Action
2.4 0005 Too High Memory Usage of the Component

Alarm Information

The memory usage of the component is too high, and alarms at different levels are
raised dynamically.
Alarm Cause
The memory usage of the component exceeds the QoS threshold.
Action
2.5 0009 Node disk usage too high

Alarm Information

The disk usage of the node is too high, and alarms at different levels are raised
dynamically.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-5

Alarm Cause
The disk usage of the node exceeds the QoS threshold.
Action
1. Check the disk usage trend of the node.

details page.
of Disk Usage Rate .
2. Check whether the disk usage trend of the node is consistent with the service status
 Yes → Go to step 3.
 No → Contact ZTE technical support.
normal.
 No → Solve the problems with the service.
4. Determine whether to increase the disk usage threshold in accordance with the
service status.
 If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
node . Click the Modify button in the Disk Usage Rate line and then modify the
2.6 0015 Minion Cpu Allocation Ratio Too High

Alarm Information

This alarm is raised when the CPU of the node is overloaded.
As the CPU usage increases, the alarm level is dynamically adjusted:
2-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm

 Alarm level: Warning
Alarm Cause
 Downward adjust the CPU Limit value of the application.

 Number of minion nodes for lateral expansion Kubernetes.
Action
1. Downward adjust the CPU Limit value of the application.

2. Number of minion nodes for lateral expansion Kubernetes.
2.7 0017 Minion Memory Allocation Ratio Too High

Alarm Information

 Alarm description:Minion Memory Allocation Ratio Too High
Alarm Cause
 The unreasonable configuration of Memory Limit value in application.

 There are too many deployed applications.
Action
1. Downward adjust the Memory Limit value of the application.

2. Number of minion nodes for lateral expansion Kubernetes.
2.8 0018 Minion Filesystem Usage Rate Too High

Alarm Information

 Alarm description: Minion Filesystem Usage Rate Too High
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-7

Alarm Cause
File system usage rate of node exceed Qos threshold.
Action
Clean up disk space to less than 90%.
2.9 0019 Node disk partition usage too high

Alarm Information

The disk partition usage of the node is too high, and alarms at different levels are
raised dynamically.
Alarm Cause
The disk partition usage of the node exceeds the QoS threshold.
Action
1. Check the disk partition usage trend of the node.

a. Select "Monitor > Alarm > Current Alarm". Click the alarm name to enter the
alarm details page.
b. In the Detail Information box, click "Go to check" on the right of "Performance
Data".
2. Check whether the disk partition usage trend of the node is consistent with the
service status based on the analysis of the on-site services.
No → Contact ZTE technical support.
normal.
No → Solve the problems with the service.
2-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
4. Determine whether to increase the disk partition usage threshold in accordance with
the service status.
If the traffic increases sharply in a short time, modification is not recommended.
If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
disk partition . Click the Modify button in the Disk Partition Usage Rate line and then
modify the thresholds at all levels.
2.10 0030 The time offset from the NTP server is too large
Alarm Information

 Alarm description:The time offset between the node and the NTP server exceeds the
threshold (default 600 seconds, available).
Alarm Cause
 The NTP server is abnormal.

 The NTP service of the node is abnormal.
Action
1. Contact the administrator to confirm that the NTP server is normal.

a. View the NTP server address of the node
SSH login to the alarm node and use the following command to view the NTP
server of the node.
[root@10-62-49-161:/home/ssf/cloudframe-projects/monitor-master]$ cat /etc/
ntp.conf |grep "^server" |grep -v "127.127.1.0"
Server 10.30.1.105 minpoll 3 maxpoll 4
# In this example, 10.30.1.105 is the NTP server address.
b. Contact the administrator to confirm that the NTP server service is normal.
2. If the NTP server is abnormal
a. Ask the administrator or related technical support personnel to solve the problem
of the NTP server.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-9

b. Start the NTP service of the node with the command “systemctl start ntpd.
service”.
c. Use the command “systemctl status ntpd.service” to check whether the NTP
service of the node is successfully started (the status is active).
3. If the NTP server is normal, the NTP service of the node may be abnormal. Please
contact technical support.
2.11 0031 Too High CPU Iowait Usage of the Node

Alarm Information

The CPU Iowait usage of the node is too high, and alarms at different levels are
raised dynamically.
Alarm Cause
The CPU Iowait usage of the node exceeds the QoS threshold.
Action
1. Check the CPU Iowait usage trend.

details page.
b. In the Detail Information box, click Go to check on the right of Performance
Data .
of Partition(Usage) .
2. Check whether the CPU Iowait Usage Rate of node is consistent with the service
status based on the analysis of the on-site services.
 Yes -> Step 3.
2-10 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
normal.
 Yes -> Step 4.
4. Determine whether to increase the CPU Iowait Usage threshold in accordance with
the service status.
node . Click the Modify button in the CPU Iowait Usage Rate line and then modify the
2.12 0032 Too High CPU Usage of Steal

Alarm Information

The CPU usage of the Steal process of the node is too high, and alarms at different
levels are raised dynamically.
Alarm Cause
The CPU usage of the Steal process of the node exceeds the QoS threshold.
Action
1. Check the CPU usage of the Steal process.

details page.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-11

of ** CPU Steal Usage Rate** .
2. Check whether the CPU usage trend of the Steal process is consistent with the
service status based on the analysis of the on-site services.
 Yes -> Step 3.
normal.
 Yes -> Step 4.
4. Determine whether to increase the CPU Steal Usage threshold in accordance with
the service status.
node . Click the Modify button in the CPU Steal Usage Rate line and then modify the
2.13 0034 Too High PID Usage of the Node

Alarm Information

The total number of PIDs used by the system currently is too large, and alarms at
different levels are raised dynamically.
Alarm Cause
The system PID usage of the node exceeds the QoS threshold.
2-12 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
Action
1. Check the PID usage trend of the Node.

details page.
of System Pid .
2. Check whether the PID usage of the nodes is consistent with the service status
 Yes -> Step 3.
normal.
 Yes -> Step 4.
4. Determine whether to increase the system pid usage threshold in accordance with
the service status.
node . Click the Modify button in the system_pid_usage_rate line and then modify
the thresholds at all levels.
2.14 0035 Too High System Load per Minute

Alarm Information

The value of counter system_load1m for node object is too high, and alarms at
different levels are raised dynamically.
→ When the load reaches 1, a warning alarm is raised.
→ When the load reaches 2, a minor alarm is raised.
→ When the load reaches 3, a major alarm is raised.
→ When the load reaches 4, a critical alarm is raised.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-13


Alarm Cause
 Existing processes hang or execute slowly.

 Node running too many process or applications.
 Insufficient node resources make the system unable to schedule in time.
Action
1. Check the trend of load per minute of the node.

details page.
of System_Load 1m .
2. Check whether the trend of load per minute of the node is consistent with the service
 Yes -> Step 3.
normal.
 Yes -> Step 4.
4. Determine whether to increase the threshold of load per minute in accordance with
the service status.
 If the traffic increases in a long time, it is suggested to adjust the QoS threshold of
load per minute and go to Step 5.
node . Click the Modify button in the system_load1m line and then modify the
2.15 0036 Node can not synchronize time with NTP server
Alarm Information
2-14 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
 Alarm description:Node can not synchronize time with NTP server.

Alarm Cause
 The local NTP service exits or is abnormal.

 Node and the NTP server network are unreachable.
 NTP server is abnormal.
Action
1. On the platform O&M portal, select Monitor > Alarm > Current Alarm . If it displays
“NTP daemon exit” or “NTP offset high”, refer to the alarm handling suggestions.
Otherwise, go to Step 2.
2. Check whether the NTP service of this node is normal.
a. SSH login to the alarm node and switch to the root user
b. Check if the NTP service is running normally
systemctl status ntpd.service
Check if the service is active. If it is not active, please contact the administrator.
3. Check whether the network between the node and the NTP server is connected.
a. SSH login to the alarm node and switch to the root user
b. Use the ping command to check whether the network between the node and the
NTP server is connected. If there are multiple servers, ping them one by one. If
the ping fails, solve the network problem first;
cat /etc/ntp.conf |grep "^server" |grep -v "127.127.1.0"
server 10.30.1.105 minpoll 3 maxpoll 4
#In the example, 10.30.1.105 is the NTP server.
ping 10.30.1.105
4. Contact the administrator to confirm that the NTP server is normal. If it is not normal,
first solve the problem of the NTP server, and then observe whether the alarm is
restored.
5. Contact ZTE technical support.
2.16 0037 Node Network Rx Rate Too High

Alarm Information

SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-15

The value of rx rate for node network is too high, and alarms at different levels are
raised dynamically:
→ When the rate reaches 300000000Bps, a warning alarm is raised.
→ When the rate reaches 500000000Bps, a minor alarm is raised.
→ When the rate reaches 750000000Bps, a major alarm is raised.
→ When the rate reaches 900000000Bps, a critical alarm is raised.
Alarm Cause
Node network rx rate of node exceed qos threshold.
Action
1. Check the trend of network rx rate of the node.

details page.
of Network Rx Rate .
2. Check whether the trend of network rx rate of the node is consistent with the service
 Yes -> Step 3.
normal.
 Yes -> Step 4.
4. Determine whether to increase the threshold of network rx rate in accordance with
the service status.
network rx rate and go to Step 5.
node . Click the Modify button in the Network Rx Rate line and then modify the
2-16 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
2.17 0038 Node Network Tx Rate Too High

Alarm Information

The value of tx rate for node network is too high, and alarms at different levels are
raised dynamically:
→ When the rate reaches 300000000Bps, a warning alarm is raised.
→ When the rate reaches 500000000Bps, a minor alarm is raised.
→ When the rate reaches 750000000Bps, a major alarm is raised.
→ When the rate reaches 900000000Bps, a critical alarm is raised.
Alarm Cause
Node network tx rate of node exceed qos threshold.
Action
1. Check the trend of network tx rate of the node.

details page.
of Network Tx Rate .
2. Check whether the trend of network tx rate of the node is consistent with the service
 Yes -> Step 3.
normal.
 Yes -> Step 4.
4. Determine whether to increase the threshold of network tx rate in accordance with
the service status.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-17

network tx rate and go to Step 5.
node . Click the Modify button in the Network Tx Rate line and then modify the
2.18 0039 NTP service exit

Alarm Information

 Alarm description:NTP service exit.
Alarm Cause
 The time difference between the node and the NTP server is too large.
 The NTP service is abnormal.
Action
1. Check the time difference between the node and the clock source to determine
whether it is because the time difference is too large. For the check method and
processing method, refer to the handling suggestions of the alarm “The time offset
from the NTP server is too large”.
2. If the time difference is too large, contact the administrator. If it is for other reasons,
contact technical support.
2.19 0050 Business performance threshold exceeded

Alarm Information

 Alarm description:Business performance threshold exceeded
Alarm Cause
The business performance indicator exceeds the threshold within the specified
inspection period.
2-18 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
Action
1. Determine the corresponding business performance indicator item by alarm details.

2. Analyze the trend of this business performance in OTCP Self Operation and
Maintenance System.
3. Process according to the prompt information in OTCP Self Operation and
Maintenance System.
2.20 0052 Middleware performance threshold exceeded

Alarm Information

 Alarm description:Middleware performance threshold exceeded
Alarm Cause
The business performance indicator exceeds the threshold within the specified
inspection period.
Action
1. Determine the corresponding middleware performance indicator item by alarm

details.
2. Enter the the hardware operation and maintenance web interface related to
this middleware performance indicator and process it according to the prompt
information.
2.21 0053 Framework performance threshold exceeded

Alarm Information

 Alarm description:Framework performance threshold exceeded
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-19

Alarm Cause
The framework performance indicator exceeds the threshold within the specified
inspection period.
Action
1. Determine the corresponding framework performance indicator item by alarm details.

2. Enter the the hardware operation and maintenance web interface related to
this middleware performance indicator and process it according to the prompt
information.
2.22 0054 self-maintain performance threshold exceeded

Alarm Information

 Alarm description:self-maintain performance threshold exceeded
 Alarm level: Warnig
Alarm Cause
The self-maintain performance indicator exceeds the threshold within the specified
inspection period.
Action
1. Determine the corresponding self-maintain performance indicator item by alarm

details.
2. Enter the the hardware operation and maintenance web interface related to this self-
maintain performance indicator and process it according to the prompt information.
2.23 1513 Performance threshold alarm

Alarm Information

 Alarm description: Performance threshold alarm
2-20 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm

Alarm Cause
None.
Action
Please process according to the business meaning of monitor task.
2.24 3001 Abnormal Minion Status

Alarm Information
 Alarm code: .3001

 Alarm description:The “Minion is not ready!” alarm is raised when the Kubernetes
detects that the node status is not ready.The “Minion is absent!” alarm is raised when
internal data is inconsistent.
 Alarm level:Warning
Alarm Cause
 The minion node is not ready.

→ The network is abnormal or the network between minion and master is abnormal.
→ The docker service is abnormal.
→ The application network is abnormal.
 The minion node is absent. An unknown reason causes inconsistence between
internal data.
Action
 If the alarm is “Minion is not ready”,

1. Check whether docker process is running properly.
a. Log in to the node in ssh mode and switch to the root user.
ssh ubuntu@IP address of the node
sudo su
b. Execute the systemctl status docker command and check whether the docker
status is active. No -> go to Step 3.
2. Check whether there is any alarm about the application network.
a. Check whether the alarm 2001 exists. If yes, solve the problem first,
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-21

b. After the alarm 2001 is cleared, check whether this alarm is cleared. Yes ->
End. No -> go to Step 3.
3. Contact the administrator.
 If the alarm is “Minion is absent”.
1. Delete the node from the cluster.
a. The value of the object ID in the alarm information is the node’s uuid.
b. View the uuid of the home cluster in the “Extra Params” in the “Detail
Information” box.
c. Delete the node from the cluster by using the command.
Log in to the control node, and swithc to the root user. Enter the command to
delete the node.
sudo su
cluster delete <cluster_uuid> node <node_uuid>
d. After the node is deleted, check whether the alarm is cleared. No -> go to
Step 2.
2. Contact the administrator.
2.25 3002 Cluster status abnormal

Alarm Information

 Alarm description:When the cluster is not unavailable, this alarm is raised.
Alarm Cause
 More than half of the control nodes in the cluster cannot provide services.
 There is no available working node in the cluster.
Action
1. On the platform O&M portal, select Environment >Business Cluster . The Detail page
is displayed. Click the Node tab and view the nodes and roles under the cluster.
 If the minion node does not exist in the cluster, go to Step 2.
 If the minion node exists in the cluster, go to Step 3.
2. Expand the capacity of the cluster and add the minion node,
a. Click the Scale out button on the page described in Step 1, and add the minion
node.
2-22 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
b. Wait for the minion node to be deployed, and check whether this alarm is cleared.
Yes -> End.
3. On the platform O&M portal, select Monitor > Alarm > Current Alarm . Check whether
the “Abnormal cluster node status” alarm exists.
 Yes -> go to Step 4.
 No ->Contact ZTE technical support.
4. View the additional information of the “Abnormal cluster node status” alarm and
check whether the home cluster of the node is the cluster that raises the alarm.
 Yes -> Step 5.
5. For each node in the cluster that raises the “Abnormal cluster node status” alarm,
follow the alarm handling suggestions. Wait until the “Abnormal cluster node status”
alarm is cleared, and check whether this alarm is cleared.
 Yes -> End.
2.26 4002 Insufficient Tenant Quota

Alarm Information

 Alarm description: When the tenant uses more than 90% of the space quota, this
alarm is raised.
Alarm Cause
The remaining disk quota of the tenant is less than 10% of the total quota.
Action
1. Remove useless versions:

Select Software Repository > Image / Software Package / Component Package , and
delete useless versions in the version list.
2. If there is no useless version, ask the platform administrator to expand the disk quota
of the tenant.
On the platform O&M portal, select ** Project Management > Project** . Select **
Modify Quota** in the Action column of the corresponding project. The ** Modify
Quota** dialog box is displayed. Modify the Image storage space .
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-23

2.27 5001 Certificate Will Expire Soon

Alarm Information
 Alarm code:5001
 Alarm description:When the user’s certificate file will expire soon.
Alarm Cause
certificate will expired soon.
Action
select Settings->Cert Manager, and check the certificate files according to the Alarm
Information,and then click Update button,Update the certificate acoording to page tips.
2.28 5002 Certificate Expired

Alarm Information

 Alarm description:When the user’s certificate file already expired.
 Alarm level: Critical
Alarm Cause
certificate expired.
Action
1. select Projects & Users -> Projects,check the failed cause of creating project,click
Retry to recreate the project.
2. If you determine that you cannot do nothing with the problem, contact ZTE technical
support.
2.29 5101 Create Project Failed

Alarm Information
 Alarm code:5101
 Alarm description:When create project failed.
2-24 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
Alarm Cause
Resources quota apply failed while creating project.
Action
1. select Projects & Users -> Projects,check the failed cause of creating project,click
Retry to recreate the project.
2. If you determine that you cannot do nothing with the problem, contact ZTE technical
support.
2.30 5102 Delete Project Failed

Alarm Information

 Alarm description:When delete project failed.
Alarm Cause
 One or more nodes of the storage cluster are down.

 The storage volume is abnormal.
 The used amount of the cluster capacity is abnormal.
 The cluster network is abnormal.
Action
1. On the platform O&M portal, select Resources > Storage . Click Built-in storage
cluster , and check whether the Status column is healthy .
 Yes -> Contact ZTE technical support.
 No -> Step 2.
2. Wait five minutes and check whether the storage cluster heals itself. After five
minutes, check whether the Status column is healthy .
 Yes -> End.
 No -> Step 3.
3. If the Status column of a cluster is unhealthy , click the cluster name to enter the
storage cluster details page. Observe the Node information list.
 If the Status column of each node is normal, contact ZTE technical support.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-25

 If the Status column of a node is abnormal, go to Step 4.

4. If the Status column of a node is abnormal, click the Restart button in the Action
column of the node to manually complete the forced recovery of the storage cluster.
Wait five minutes and observe the Status column of the node. If it is normal and the
Status column of the corresponding storage cluster is healthy , the alarm handling is
completed. Otherwise, contact ZTE technical support.
2.31 7001 Storage cluster status abnormal

Alarm Information

 Alarm description: When the system detects that the cluster status is not healthy, this
alarm is raised.
Alarm Cause
 One or more Glusterfs nodes break down.

 Volume status is abnormal.
 Usage of volume is abnormal.
 Cluster network is abnormal.
Action
1. On the platform O&M portal, select Resources > Storage . Click Built-in storage
cluster , and check whether the Status column is healthy .
 Yes -> Contact ZTE technical support.
 No -> Step 2.
2. Wait five minutes and check whether the storage cluster heals itself. After five
minutes, check whether the Status column is healthy .
 Yes -> End.
 No -> Step 3.
3. If the Status column of a cluster is unhealthy , click the cluster name to enter the
storage cluster details page. Observe the Node information list.
 If the Status column of each node is normal, contact ZTE technical support.
 If the Status column of a node is abnormal, go to Step 4.
4. If the Status column of a node is abnormal, click the Restart button in the Action
column of the node to manually complete the forced recovery of the storage cluster.
2-26 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
Wait five minutes and observe the Status column of the node. If it is normal and the
Status column of the corresponding storage cluster is healthy , the alarm handling is
completed. Otherwise, contact ZTE technical support.
2.32 7002 Cluster Capacity Usage Exceeded the Threshold

Alarm Information

 Alarm description: When the used capacity of the cluster exceeds 80% of the total
capacity, this alarm is raised.
Alarm Cause
The used capacity of the cluster exceeds 80% of the total capacity
Action
Check whether there is useless data in the storage cluster.

 If yes, delete the useless data to save the cluster space.
 If no, ask the platform administrator to add a storage device to expand the storage
cluster capacity.
2.33 7003 Volume Capacity Usage Exceeded the Threshold

Alarm Information

 Alarm description: When the used capacity of the volume exceeds 80% of the total
capacity, this alarm is raised.
Alarm Cause
The used capacity of the volume exceeds 80% of the total capacity
Action
Check whether there is useless data in the storage volume.

 If yes, delete the useless data to save the volume space.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-27

 If no, ask the platform administrator to add a storage device to expand the storage
volume capacity.
2.34 8501 NBM Initialization Failed

Alarm Information

 Alarm description:The rabbitmq service is abnormal.
Alarm Cause
The NBM service fails.
Action
Contact ZTE technical support to check whether the rabbitmq service is normal.
2.35 9101 PostgreSQL database cluster unavailable

Alarm Information

When the SLB pod is powered on successfully, this notification is reported.
The SLB provides the load balancing service for the services deployed within the
PaaS. This notification helps users know if the SLB is successfully powered on.
Alarm Cause
Nodes with PG status of LATEST or SYNC in cluster fail to start for some reason, while
other PGs in other states can start normally, but without the right to be promoted, the
whole cluster can not choose the master node.
Action
First, check the start log of PG with status of LATEST or SYNC, or start PG manually
through PSQL client to find the cause of start failure.
If it can’t start at all, it can only start the PG with status of non-LATEST or non-SYNC.
But at this time, the PG with status
2-28 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
of non-LATEST or non-SYNC may have fewer data than that with status of LATEST or
SYNC, and force the start of non-LATEST or non-SYNCmay result in a small amount of
data loss.
For other information, please contact ZTE technical support staff.
2.36 9102 PostgreSQL database cluster contains unavailable

nodes
Alarm Information

 Alarm description:This alert is generated when the PostgreSQL database cluster
detector detects DISCONNECT nodes in the cluster.
Alarm Cause
Nodes in the database cluster can not be used normally.
Action
1. Use command crm status checks the status of the cluster, if there is master and
standby node also starts normally, but the stream replication is abnormal.
Through Self-Management Entry - > pg-mng: Click on Enter PG Manager page,
click the problem node with pull the data to pull data from master node manually.
2. Master exists and standby node failed to start.
a. First, check the log of the failed PG, or start the PG manually through the PSQL
client to find the cause of the failure.
b. If the standby node can’t start at all, select the problem node to pull the data
manually
c. through self-management entrance - > pg-mng: click on to enter PG Manager
page .
3. Without master, the PG with LATEST or SYNC status did not start successfully.
a. First, check the start-up log of PG with status of LATEST or SYNC, or start PG
manually with PSQL client to find the cause of start-up failure.
b. If it can’t start completely, restore database if the backup is available. If restore is
not the option, it can only start the PG with status of non-LATEST or non-SYNC.
c. But at this time, the PG with status of non-LATEST or non-SYNC may have fewer
data than that with status of LATEST or SYNC,
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-29

d. and force the start of non-LATEST or non-SYNC may result in a small amount of
data loss.
Please contact ZTE Communications Technical Support to check whether service is
normal.
2.37 9103 PostgreSQL database cluster replication interrupts

or produces brain-split
Alarm Information

 Alarm description:This alarm is generated when PostgreSQL database cluster
detector detects cluster replication interruption or brain-split.
Alarm Cause
The master node failed to be promoted, and there was no master node providing
external services.
Action
In PaaS Operations and Maintenance Interface Monitoring - > Alarm - > Details Page
Details Item, check 9103 alarm specific reasons.
 If the reason is shown as Need People Repair, Streaming is break, You can use
portal to pull full data from Master.
It indicates that the stream replication is broken and the data needs to be pulled
manually through the management interface.
Through Self-management Entry - > pg-mng : Click on to enter PG Manager page,
select the problem node to pull the data manually.
 If the reason is shown as it is possible to have a split brain, keep ban status.
indicates that the time line of the original
master node and the new master node is the same, it may be the case of brain-split,
which requires manual comparison and merging of data.
 If the reason is shown as It is possible to have a split brain, keep ban status. Marks
Maybe data loss or database abnormality,
need Repair, keep ban status. indicates that the difference between master and standby
node data is greater than the preset value, it needs manual comparison and merging
data.
2-30 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
2.38 9104 PostgreSQL database master and standby cluster

replication interruption
Alarm Information

 Alarm description:In disaster recovery scenario, this alarm is generated when
the stream replication of master and standby nodes of PostgreSQL database is
interrupted.
Alarm Cause
The stream replication of disaster recovery cluster and master cluster is interrupted and
disaster recovery fails.
Action
 Enter the master cluster to view any PostgreSQL server container

1. 1.Check whether the master cluster status is primary by:
a. crm_attribute-n cluster_mode-G-q .
b. If it is not primary, it means that there is no primary cluster, confirm whether
pronotion operation of master cluster is required
2. 2.Check whether the master by:
a. crm status .
b. If there is no master in the main cluster, see 9101 alert .
 Enter the slave cluster to view any PostgreSQL server container
1. 1.Check whether stream replication is paused by:
a. crm_attribute-n cluster_rep_status-G-q .
b. If pause, you need to turn on stream replication
2. 2.Whether the standby cluster has IP and PORT for the primary cluster:
a. Crm_attribute-n master_cluster_ip-G-q .
b. Crm_attribute-n master_cluster_port-G-q .
c. If not, additional configuration operations are needed to reset the IP and
PORT of the primary cluster.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-31

2.39 9105 PostgreSQL database failed to archive log file

Alarm Information

 Alarm description:This alarm is generated when the number of unsuccessful archive
log in the PostgreSQL database log archive directory exceeds 5.
Alarm Cause
 FTP network interruption.

 Backup command configuration error.
Action
First, execute the statement

exec -it pod_name -n opcs bash
at the control node to enter the database docker.
Second, check the csv log file with the key word archive command failed to review the
cause of archive failure.
Thirdly, solute problems according to following description.
Table2-1 Table 1.1 Error list for log archive

Error Code Cause Solutions
20 ftp transform error Test network
40 ftp config error Check ftp configuration
111 FTPPORT is not found Config ftp port
112 FTPIP is not found Config ftp IP
113 FTPUSER is not found Config ftp USER
114 FTPPASSWORD is not found Config ftp password
115 FTP WALDATAPATH is not Config ftp backup path

found
116 WALDATAPATH is not found Config wal log name
117 full walname is not found Config wal full path and name
118 walname is not found Config wal log name
2-32 SJ-20191220164305-036 | 2020-01-10 (R1.0)

2 QoS Alarm
2.40 9141 Index is damaged in common service Elasticsearch

Alarm Information
 Alarm code:9141
 Alarm description:When index status is red in common service Elasticsearch, the
index is damaged and this alarm is generated.
Alarm Cause
 The disk of the node where the Elasticsearch is deployed is abnormal.

 The network of the node where the Elasticsearch is deployed is abnormal.
Action
1. Check whether there are related alarms on the application management page.
a. In PaaS Operations and Maintenance Interface ** Monitoring - > Alarm - > Details
** Page ** Details ** Item, get Object ID.
b. In opcs project ** Application Manager ** page, find the application named “
commsrves-<Object ID>”. Enter the ** Alarm ** page of the application, and
check whether there is any current alarm.
 Yes -> Handle the fault based on related handling suggestions.
 No -> please contact ZTE technical support.
2. For other information, please contact ZTE technical support.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 2-33

2-34 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Chapter 3
OMC Alarm
Table of Contents
1000 User locked...........................................................................................................3-2
1001 Hard disk usage of database server overload......................................................3-2
1002 CPU usage of application server overload...........................................................3-3
1003 RAM usage of application server overload.......................................................... 3-3
1004 Application server disk-overload...........................................................................3-3
1008 Database instance space usage too large........................................................... 3-4
1012 License is expired.................................................................................................3-4
1013 License is about to expire.................................................................................... 3-5
1015 The link between the server and the ME agent is broken....................................3-5
1017 The time in which the designated alarm remains active has expired................... 3-6
1018 The time in which the designated alarm remains unacknowledged has
expired............................................................................................................................3-6
1022 Merge rule root alarm...........................................................................................3-7
1023 Suppress plan task............................................................................................... 3-7
1025 Automatic backup failure...................................................................................... 3-8
1028 Alarm forwarding failure........................................................................................3-8
1034 License consumption exceeds the alarm threshold............................................. 3-9
1035 License consumption exceeds the total authorization.......................................... 3-9
1050 Wrong login password.......................................................................................... 3-9
1060 The number of users assigned the specific type exceeds the limit.................... 3-10
1061 The number of users assigned the specific type is about to exceed the limit..... 3-10
1300 Password has expired........................................................................................ 3-11
1301 Password will expire........................................................................................... 3-11
1310 The number of login users exceeds the limit..................................................... 3-11
1311 SNMP authentication failure............................................................................... 3-12
SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-1

3.1 1000 User locked

Alarm Information

 Alarm description: User locked
 Alarm type: OMC Alarm
Alarm Cause
None.
Action
Check and analyze the login log to find whether the problem is caused by a password
guessing attack. If no, contact the system administrator for unlocking the user account.
3.2 1001 Hard disk usage of database server overload

Alarm Information

 Alarm description: Hard disk usage of database server overload
Alarm Cause
 The disk space occupied by audit logs exceeds the threshold "Lower Clean Percent".
 The disk space occupied by program logs exceeds the threshold "Lower Clean
Percent".
Action
Perform the following operations as required:

 Perform manual backup of the over-limit module big data in Backup and Recovery,
select Delete or Backup and delete.
 In Big Data Automatic Backup, modify Task Setting of the over-limit module, and
adjust the module's Clean by Time, Space Capacity, Lower Clean Percent or Upper
Clean Percent."
3-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

3 OMC Alarm
3.3 1002 CPU usage of application server overload

Alarm Information

 Alarm description:CPU usage of application server overload
 Alarm level:Undefined
 Alarm type:OMC Alarm
Alarm Cause
None.
Action
1. Check that the load of the UME is within the allowable range.
2. Check whether any unnecessary applications are running on the UME server. If yes,
exit those unnecessary applications.
3.4 1003 RAM usage of application server overload

Alarm Information

 Alarm description: RAM usage of application server overload
Alarm Cause
None.
Action
1. Check that the load of the UME is within the allowed range.
2. Check whether any unnecessary applications are running on the UME server. If yes,
exit those applications to release some RAM.
3. Expand the RAM of the application server.
3.5 1004 Application server disk-overload

Alarm Information

 Alarm description: Application server disk-overload
SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-3


Alarm Cause
None.
Action
The system administrator is recommended to handle this alarm as follows:

1. Check that the space of the hard disk in the application server has been properly
allocated.
2. Expand the hard disk.
3.6 1008 Database instance space usage too large

Alarm Information

 Alarm description: Database instance space usage too large
Alarm Cause
None.
Action
Do the following to remove the probable problems causing this alarm:

1. Back up and delete historical data.
2. Clean the database periodically.
3. Allocate more space to the database instance.
3.7 1012 License is expired

Alarm Information

 Alarm description: License is expired
3-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)

3 OMC Alarm
Alarm Cause
None.
Action
Contact your local vendor office for a new license.
3.8 1013 License is about to expire

Alarm Information

 Alarm description: License is about to expire
Alarm Cause
None.
Action
3.9 1015 The link between the server and the ME agent is
broken
Alarm Information

 Alarm description: The link between the server and the ME agent is broken
Alarm Cause
None.
Action
Please check the link between the server and the agent. Check the connection as
follows:
1. On the Alarm Monitor interface, view the detail of the alarm to find the information
of the agent which the link is broken. Go to the proxy access UI, check the proxy
address information in the details of the corresponding proxy.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-5

2. Verify that the network is faulty or not. If the network is faulty, please configure the
firewall.
3. Otherwise, restart the agent.
3.10 1017 The time in which the designated alarm remains

active has expired
Alarm Information

 Alarm description: The time in which the designated alarm remains active has
expired
Alarm Cause
None.
Action
Do the following to handle this alarm:

1. On the Alarm Monitor view of the UME client GUI, view the details of the alarm to find
the original alarm that persists for a long time, which causes this alarm.
2. Find the handling suggestion of the original alarm by its alarm code, and then handle
the original alarm according to the suggestion.
3.11 1018 The time in which the designated alarm remains

unacknowledged has expired
Alarm Information

 Alarm description: The time in which the designated alarm remains unacknowledged
has expired
Alarm Cause
None.
3-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)

3 OMC Alarm
Action

1. Open the detail of the alarm, and view the details of the alarm to find the original
alarm that causes this alarm.
2. Find the handling suggestion of the original alarm by its alarm code, and then handle
the original alarm according to the suggestion.
3.12 1022 Merge rule root alarm

Alarm Information

 Alarm description: Merge rule root alarm
Alarm Cause
None.
Action
1. This alarm in the current alarm table to show all merged alarms.
2. Find the handling suggestion of each merged alarm by its alarm code, and then
handle the corresponding alarm according to the suggestion.
3.13 1023 Suppress plan task

Alarm Information

 Alarm description: Suppress plan task
Alarm Cause
None.
Action
During engineering cutover and switchover,this alarm is used to suppress alarms

reported by the equipment, and users do not need to handle it. After engineering cutover
SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-7

and switchover, users should clear this alarm. If the equipment alarms suppressed by
this alarm are already cleared, they do not need to be handled again. If some equipment
alarms are not cleared yet, users need to check and handle these equipment alarms.
3.14 1025 Automatic backup failure

Alarm Information

 Alarm description: Automatic backup failure
Alarm Cause
None.
Action

1. On the Alarm Monitor view of the UME client GUI, view the details of the alarm.
2. Handle backup failure based on information provided by additional alarm
parameters.
3. Remove the alarm manually after fault treatment.
3.15 1028 Alarm forwarding failure

Alarm Information

 Alarm description: Alarm forwarding failure
Alarm Cause
None.
Action
1. If the SMS fails, please check the phone number.

2. If email fails, please check the email address or the configuration of email
configuration Center.
3-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)

3 OMC Alarm
3.16 1034 License consumption exceeds the alarm threshold

Alarm Information

 Alarm description: License consumption exceeds the alarm threshold
Alarm Cause
None.
Action
Update the license file in a timely manner.
3.17 1035 License consumption exceeds the total

authorization
Alarm Information

 Alarm description: License consumption exceeds the total authorization
Alarm Cause
None.
Action
Update the license file in a timely manner.
3.18 1050 Wrong login password

Alarm Information

 Alarm description: Wrong login password
SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-9

Alarm Cause
None.
Action
Please input the correct password.
3.19 1060 The number of users assigned the specific type

exceeds the limit
Alarm Information

 Alarm description: The number of users assigned the specific type exceeds the limit
Alarm Cause
None.
Action
3.20 1061 The number of users assigned the specific type is

about to exceed the limit
Alarm Information

 Alarm description: The number of users assigned the specific type is about to exceed
the limit
 Alarm level: Minor
Alarm Cause
None.
Action
3-10 SJ-20191220164305-036 | 2020-01-10 (R1.0)

3 OMC Alarm
3.21 1300 Password has expired

Alarm Information

 Alarm description: Password has expired
Alarm Cause
None.
Action
Please modify the password of the commonResouceservice instance.
3.22 1301 Password will expire

Alarm Information

 Alarm description: Password will expire
Alarm Cause
None.
Action
Please modify the password of the commonResouceservice instance.
3.23 1310 The number of login users exceeds the limit

Alarm Information

 Alarm description: The number of login users exceeds the limit
Alarm Cause
None.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 3-11

Action
The number of login users exceeds the threshold. Please deal with the number of login
users and avoid abnormal system.
3.24 1311 SNMP authentication failure

Alarm Information

 Alarm description: SNMP authentication failure
Alarm Cause
None.
Action
Please modify the SNMP parameters.
3-12 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Chapter 4
Communication Alarm
Table of Contents
1014 The link between the server and the ME is broken..............................................4-1
1040 ME or agent backend start failure........................................................................ 4-1
200204012 S1 link is broken.........................................................................................4-2
200204013 Power supply failure................................................................................... 4-2
200204014 Transport failure..........................................................................................4-3
4.1 1014 The link between the server and the ME is broken
Alarm Information

 Alarm description: The link between the server and the ME is broken
 Alarm type: Communication Alarm
Alarm Cause
None.
Action
Do the following to check whether the link between the server and the ME is normal:
1. Find the IP address of the ME on the Topo Management page.
2. Ping the IP address of the ME from the server.
3. If the second step fails, then you need to solve the network or the ME problems.
4.2 1040 ME or agent backend start failure

Alarm Information

 Alarm description: ME or agent backend start failure
SJ-20191220164305-036 | 2020-01-10 (R1.0) 4-1


Alarm Cause
None.
Action
Please manually start ME or agent in the topology management page.
4.3 200204012 S1 link is broken

Alarm Information
 Alarm code: 200204012

 Alarm description: S1 link is broken
Alarm Cause
S1 link is broken
Action
Based on the network planning, check whether the settings of the IP address and the
route of each node (such as the BSC, RNC, and switching devices) over the transport
path are correct.
4.4 200204013 Power supply failure

Alarm Information
 Alarm code: 200204013

 Alarm description: Power supply failure
Alarm Cause
Power supply failure
Action
Check whether the power supply equipment in the equipment room is normal or not.
4-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

4 Communication Alarm
4.5 200204014 Transport failure

Alarm Information
 Alarm code: 200204014

 Alarm description: Transport failure
Alarm Cause
Transport failure
Action
Check whether the gateway is normal or not.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 4-3

4-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Chapter 5
Processing Error Alarm
Table of Contents
0502 K8s schedule failed.............................................................................................. 5-1
0503 K8s create pod failed........................................................................................... 5-3
0504 Failed to Delete a Pod......................................................................................... 5-3
1014 Abnormal Service Operational Status.................................................................. 5-4
1015 Abnormal Mircroservice Operational Status......................................................... 5-5
2001 Add network for Pod error....................................................................................5-6
2002 IaaS account authentication failed....................................................................... 5-7
8001 Commonservice deployed failed.......................................................................... 5-7
9302 Failed to synchronize data to slave zone.............................................................5-8
5.1 0502 K8s schedule failed

Alarm Information

 Alarm description: K8s schedule failed
 Alarm type: Processing Error Alarm
Alarm Cause
 The node has insufficient CPU or disk space.

 The resources available to the node (CPU/memory) do not meet the requirement by
the application.
 The application has set the node affinity, which does not match the node label.
 The application has set the application affinity, which does not match the existing
application on the node.
Action
1. Login URL "http://[ip address]/portaladmin", open "Monitor" -> "Alarm" -> "Current
Alarm" tab.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-1

 If there has "K8s Report Node Has Insufficient Memory" or "K8s Report Node Has
Disk Pressure", to deal with the alarm according to suggestion.
 If none of above, turn to step 2.
2. Open "Resource" -> "Compute" -> "Nodes", click Kubernetes node name to enter
detail page, enter "Resources Monitor" page to check whether CPU/Memory is
satisfied by Pod.
 No → Go to step 3.
3. Verify that the CPU/memory resource requested by the application is adjustable.
4. Adjust the number of CPU/memory resources requested by the application and
redeploy the application. Login portal page, enter "AppManager", select the App
name, click the "Delete".
Open "Software Repository" -> "Image" page, find the corresponding image, click "
Deploy" button to redeploy the CPU and Memory. Or open "Software Repository" ->
"Blueprint" page, click the corresponding blueprint name, click "Edit" -> "Container"
icon -> "Advanced setting" -> "Configure Resources", modify the CPU/Memory
parameters, and redeploy the blueprint.
5. Increase node resources for clusters. Login Portaladmin page -> "Environment" -
> "Business Cluster" page, click cluster name -> "Nodes" -> "Scale out", fill in all
necessary parameters, and click "Scale out" button.
6. Check whether Pod affinity matches with the Pod label. Login Portaladmin page -> "
Environment" -> "Business Cluster" -> cluster name -> "Node".
7. Modify the node affinity configuration of the application and redeploy the application.
Open "AppManager" page, click "Delete" button to delete this App. Open "Software
Repository" -> "Image" page, find the image which is used by the application,
click "deploy" -> "Show Advanced Settings" -> "affinity config", fill in the necessary
parameters, the deploy the Image. Or open "Software Repository" -> "Blueprint"
page, find the blueprint which is used by the application, click "deploy" -> "Show
Advanced Settings", fill in the necessary parameters, and click "deploy".
8. Open "AppManager" page, check that whether all applications that have affinity/anti-
affinity relationship with this application are correct.
 Yes → Please contact ZTE technical support.
5-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

5 Processing Error Alarm
9. Modify the node affinity configuration of the application and redeploy the application.
Open "AppManager" page, click "Delete" button to delete this App. Open "Software
Repository" -> "Image" page, find the image which is used by the application,
click "deploy" -> "Show Advanced Settings" -> "affinity config", fill in the necessary
parameters, the deploy the Image. Or open "Software Repository" -> "Blueprint"
page, find the blueprint which is used by the application, click "deploy" -> "Show
Advanced Settings", fill in the necessary parameters, and click "deploy".
5.2 0503 K8s create pod failed

Alarm Information

 Alarm description: K8s create pod failed
Alarm Cause
K8s kube-apiserver process is abnormal. The etcd server is abnormal.
Action
Login Portaladmin page-> "Monitor" -> "Alarm" -> "Current Alarm", check whether there
is a "cluster status abnormal" alarm.
 Yes → Do the operation with suggestion of the "cluster status abnormal" alarm.
 No → Please contact ZTE technical support.
5.3 0504 Failed to Delete a Pod

Alarm Information

 Alarm description: Failed to Delete a Pod
Alarm Cause
Kubernetes API access error.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-3

Action
On the platform O&M portal, select Monitor > Alarm > Current Alarm . Check whether
the “Abnormal cluster status” alarm exists.
 Yes -> Handle the fault based on related handling suggestions.
5.4 1014 Abnormal Service Operational Status

Alarm Information

 Alarm description: Abnormal Service Operational Status
Alarm Cause
 The application fails to be deployed or upgraded.

 The working instance of the application does not operate or the operational status is
not healthy.
Action
1. On the “Event” tab of the application details page, use “AppRunAbnormally” to filter
the events that are searched.
 If it displays “could not run normally in appointed time” in the event description, go
to Step 2.
 If it displays “select cluster fail” in the event description, go to Step 3.
 If it displays “resource of tenant is not enough” in the event description, this
indicates the resource quota of the tenant is insufficient. Contact the platform
administrator.
 For other displayed information, go to Step 4.
2. On the “Current Alarm” tab of the application details page, view the alarms.
 If the “Kubernetes Failed to Dispatch Pod” alarm exists, handle the fault based on
related handling suggestions.
 If the “Pod Network Configuration Failure” alarm exists, handle the fault based on
 If the “Failed to Create a Pod” alarm exists, handle the fault based on related
handling suggestions.
5-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)

 If the “Failed to Mount a Volume to the Pod” alarm exists, handle the fault based
on related handling suggestions.
 If the above alarms do not exist, go to Step 4.
3. On the platform O&M portal, select Environment >Business Cluster . View the
information in the “Available status” column.
 It displays “Yes” -> Go to Step 4.
 It displays “No” - > On the platform O&M portal, select Monitor > Alarm >
Current Alarm . If the “Abnormal cluster status” alarm exists, handle the fault
based on related handling suggestions.
4. Attempt to analyze the cause of the failure according to the details of the
AppRunAbnormally event.
 Clearly describe the cause of the failure and contact the platform administrator to
fix the failure.
 If the cause of the failure is unclear, contact ZTE technical support.
5.5 1015 Abnormal Mircroservice Operational Status

Alarm Information

 Alarm description: Abnormal Mircroservice Operational Status
Alarm Cause
 The application fails to be deployed or upgraded.

 The working instance of the application does not operate or the operational status is
not healthy.
Action
1. On the “Event” tab of the application details page, use “AppRunAbnormally” to filter
the events that are searched.
 If it displays “could not run normally in appointed time” in the event description, go
to Step 2.
 If it displays “select cluster fail” in the event description, go to Step 3.
 If it displays “resource of tenant is not enough” in the event description, this
indicates the resource quota of the tenant is insufficient. Contact the platform
administrator.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-5

 For other displayed information, go to Step 4.

2. On the “Current Alarm” tab of the application details page, view the alarms.
 If the “Kubernetes Failed to Dispatch Pod” alarm exists, handle the fault based on
 If the “Pod Network Configuration Failure” alarm exists, handle the fault based on
 If the “Failed to Create a Pod” alarm exists, handle the fault based on related
handling suggestions.
 If the “Failed to Mount a Volume to the Pod” alarm exists, handle the fault based
on related handling suggestions.
 If the above alarms do not exist, go to Step 4.
3. On the platform O&M portal, select Environment >Business Cluster . View the
information in the “Available status” column.
 It displays “Yes” -> Go to Step 4.
 It displays “No” - > On the platform O&M portal, select Monitor > Alarm >
Current Alarm . If the “Abnormal cluster status” alarm exists, handle the fault
based on related handling suggestions.
4. Attempt to analyze the cause of the failure according to the details of the
AppRunAbnormally event.
 Clearly describe the cause of the failure and contact the platform administrator to
fix the failure.
 If the cause of the failure is unclear, contact ZTE technical support.
5.6 2001 Add network for Pod error

Alarm Information

 Alarm description: Add network for Pod error
Alarm Cause
In underlay scenario, because of insufficient port resource quota of IaaS, the creation
of network ports by PaaS network components failed. The PaaS network misses the
network specified in the Pod blueprint.
5-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Action
1. According to the IaaS tenant used by PaaS, modify the resource tenant quota
in the IaaS environment, Contact IaaS administrator to modify the tenant quota
configuration.
2. Check whether the PaaS network has created the network planned for use in the
Pod blueprint. Open Portaladmin system -> "Resources" -> "Network" page, check
whether the network is created. If no, click "Create Network" to add a new one.
5.7 2002 IaaS account authentication failed

Alarm Information

 Alarm description: IaaS account authentication failed
Alarm Cause
Incorrect IaaS user name, password or IaaS address.
Action
1. Obtain the correct IaaS's username, password, and IP address.

2. Enter "Resources"->"Compute", click "Add Region", fill in the correct IaaS's
username, password, and IP address.
5.8 8001 Commonservice deployed failed

Alarm Information

 Alarm description: Commonservice deployed failed
Alarm Cause
Download blueprint failed, Create PVC failed, Create IPGroup failed, Deploy pdm/vnpm
server failed, Deploy broker failed.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-7

Action
1. If the detail information of the alarm is download BluePrint failed , check "Software
Repository"->"Blueprint" , according to the deployed commonservice name and
version number to check whether the corresponding commonservice blueprint exists.
No → Please contact the administrator to upload the blueprint version.
Yes → Please contact the administrator to confirm that whether the software
repository is normal.
2. If the detail information of the alarm is create PVC failed, check share storage node.
Please contact the administrator to confirm whether the environment has storage
clusters or volume capacity resources are out of limit.
3. If the detail information of the alarm is NW create ipgroup failed, check the network.
Please contact the administrator to confirm the network.
4. If the detail information of the alarm is vnpm deploy server failed, check events from
VNPM. Check "Monitor"->"Alarm"->"Current Alarm" page to see whether there is a "
cluster status abnormal" alarm.
Yes → Click "Alarm Name" to see the "detail information"->"Suggestion", to deal with
this alarm.
No → Go to step 6.
5. If the detail information of the alarm is vnpm deploy broker failed, check events from
VNPM. Same operation with step 4.
6. Please contact ZTE technical support.
5.9 9302 Failed to synchronize data to slave zone

Alarm Information

 Alarm description: Failed to synchronize data to slave zone
Alarm Cause
The master zone failed to send data to slave zone probably caused by factors listed
below.
 A network failure.
 A network congistionn.
 The peer entity in slave zone is offline.
5-8 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Action
Contact ZTE technical support to check the component status and net configurations for
disaster recovery and repair.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 5-9

5-10 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Chapter 6
Environment Alarm
Table of Contents
9121 FTP disk space is insufficient...............................................................................6-1
9122 FTP disk read and write exception.......................................................................6-2
9201 Common Service Kafka node is offline................................................................ 6-2
9301 The connection for geographical disaster recovery is broken.............................. 6-3
6.1 9121 FTP disk space is insufficient

Alarm Information
 Alarm code:9121
Excessive FTP disk space usage is causing this alarm.
As FTP disk space usage increases, this alarm level will be dynamically adjusted.
The default threshold is as follows:
→ Usage rate reaches 70%, report major alarm, at this time FTP function has no
real impact, designed to remind users to clean up the disk space in time. Clear
alarms when usage drops below 60%.
→ Usage rate reaches 90%, report critical alarm, at this time, the FTP server will
be set to read only, only to view, download and delete files, not to upload files.
Immediately clean up the FTP disk space. When the utilization rate drops below
80%, the alarm level becomes important. FTP resumes writable capacity.
The above threshold can be customized by users in the FTP administration page.
 Alarm level: critical
Alarm Cause
FTP disk space is insufficient.
Action
Contact ZTE technical support to clean up unused data on the FTP server.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 6-1

6.2 9122 FTP disk read and write exception

Alarm Information
 Alarm code:5101
 Alarm description:Report this warning when a Shared volume or local disk attached
to the FTP service cannot be read or written properly..
Alarm Cause
FTP disk is not available.
Action
Contact ZTE technical support to repair FTP disk failure.
6.3 9201 Common Service Kafka node is offline

Alarm Information
 Alarm code:9201
 Alarm description:When create project failed.
Alarm Cause
 Common Service Kafka memory overflow.

 Zookeeper Session Timeout.
Action
1. Log into the node deployed in Kafka, go to the directory /paasdata/op-comsrv/log/

Apache-Kafka/entity_name/broker/logs.check the kafkaserver-gc.log.*.Current log
file,and observe whether gc errors occur, such as Full GC Failure.
 Yes→ In PortalAdmin ** CommonService -> Kafka -> entity_name ** page, Click
the ** upgrade ** button at the top of the page to select the Kafka version and
adjust the deployment value of Kafka memory size.
 No→ Step 2.
2. In PortalAdmin ** CommonService -> Kafka -> entity_name ** page
a. In the installation information section, check the value of zookeeper session
timeout ms.
6-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

6 Environment Alarm
b. Click ** self-management entry ** below to enter Kafka Manager, and click **

Cluster ** option. In ** Zookeeper node list ** check whether the maximum delay
of Zookeeper is greater than the value of zookeeper session timeout.
 Yes→ Click the ** upgrade ** button at the top of the page to select a Kafka
version and adjust the deployment value of zookeeper session timeout ms,
Generally, 30s is recommended, no more than 60s.
 No→ Step 3.
3. In PortalAdmin ** CommonService -> Kafka -> entity_name ** page
a. click the ** Kafka Manager ** below to enter Kafka Manager.
b. In ** cluster ** page, click the health detection button to test whether the Kafka
cluster messaging function is abnormal.
 Yes→ Click the ** upgrade ** button at the top of the page, For the same
version of Kafka, the upgrade cannot take effect if the parameter values
remain unchanged. deployment parameters of Kafka and Zookeeper can be
adjusted slightly
 No→ Step 3.
4. For other information, please contact ZTE technical support staff.
6.4 9301 The connection for geographical disaster recovery

is broken
Alarm Information
 Alarm code:9301
 Alarm description:Report this warning when the connection for geographical disaster
recovery is broken down.
Alarm Cause
The connection is broken down probably caused by factors listed below.

 A network failure.
 A network congistionn.
 A incorrect configuration to a IP address or port.
Action
Contact ZTE technical support to check net configurations for disaster recovery and
repair.
SJ-20191220164305-036 | 2020-01-10 (R1.0) 6-3

6-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Chapter 7
Integrity Violation
Alarm
Table of Contents
15010001 Alarm for Missing of PM Data...................................................................... 7-1
15010002 Alarm for Missing of NAF PM Data.............................................................. 7-1
7.1 15010001 Alarm for Missing of PM Data

Alarm Information

 Alarm description: Alarm for Missing of PM Data
 Alarm type: Intergrity Violation Alarm
Alarm Cause
See Details.
Action
1. Job Inconsistent: Automatic Recovery,No Next Operation.

2. Link Break: See Detail of Link Break.
3. FtpServer Error: Connect to FtpServer for further positioning.
4. No Data in ME: Please Connect to ME.
5. Lack of DB Space: Please Add Disk.
7.2 15010002 Alarm for Missing of NAF PM Data

Alarm Information

 Alarm description: Alarm for Missing of NAF PM Data
SJ-20191220164305-036 | 2020-01-10 (R1.0) 7-1


 Alarm type: Integrity Violation Alarm
Alarm Cause
See Details.
Action
1. Job Inconsistent: Automatic Recovery,No Next Operation.

2. Link Break: See Detail of Link Break.
3. FtpServer Error: Connect to FtpServer for further positioning.
4. No Data in ME: Please Connect to ME.
5. The FTPServer for NMS is error: Connect to FtpServer for further positioning.
7-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Glossary
BSC
- Base Station Controller
CPU
- Central Processing Unit
RNC
- Radio Network Controller
SNMP
- Simple Network Management Protocol
UME
- Unified Management Expert

Pubm Ume r18 Alarm-Handling en

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pubm Ume r18 Alarm-Handling en

Uploaded by

Copyright:

Available Formats

ElasticNet UME R18

Unified Management Expert System

are protected by contractual confidentiality obligations.

of ZTE CORPORATION or of their respective owners.

license to the subject matter herein.

The ultimate right to interpret this product resides in ZTE CORPORATION.

Revision No. Revision Date Revision Reason

Serial Number: SJ-20191220164305-036

Publishing Date: 2020-01-10 (R1.0)

This manual is intended for:

What Is in This Manual

This manual contains the following chapters.

The following documentation is related to this manual:

This manual uses the following conventions.

Note: provides additional information about a topic.

1.1 0001 Container Startup Failed

 Alarm code: 0001

Blueprint container images error

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-1

1.2 0011 Node Heartbeat lost

 Alarm code: 0011

The nodes is faulty or the link is abnormal.

1-2 SJ-20191220164305-036 | 2020-01-10 (R1.0)

1.3 0051 Hardware performance threshold exceeded

 Alarm code: 0051

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-3

1. Determine the corresponding hardware performance indicator item by alarm details.

1.4 0069 Hardware Alarm

 Alarm code: 0069

Hardware Alarm,Pelease Check Detail Information Field.

1.5 3003 Component Instance State Exception

 Alarm code: 3003

Component instance is running abnormally

1-4 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Component instance is not running

1.6 7004 NFS Shared Volume Multi-Mounted In The Cluster

 Alarm code: 7004

The NFS shared volume is concurrently mounted in the Multiple computers.

Please contact ZTE technical support staff

1.7 7005 Read-Only NFS Shared Volume

 Alarm code: 7005

The NFS shared volume is mounted in the read-only status.

Please contact ZTE technical support staff

1.8 9031 One Port of the Bond Group Fault

 Alarm code: 9031

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-5

 Alarm description: One Port of the Bond Group Fault

The Ethernet interface of the bond group is faulty.

1.9 9032 All Ports of the Bond Group Fault

 Alarm code: 9032

All of the Ethernet interfaces of the bond group are faulty.

1.10 9033 OVS Service Fault

 Alarm code: 9033

The service of openvswitch is faulty, or the process of ovsdb-server and ovs-vswitchd

1-6 SJ-20191220164305-036 | 2020-01-10 (R1.0)

Contact ZTE technical support.

1.11 9321 Platform pg database is unusable

 Alarm code: 9321

Postgresql_ip resource is lost, probably caused by factors listed below:

1.12 9322 Plateform pg node instance is abnormal

 Alarm code: 9322

Plateform pg node instance is abnormal, probably caused by factors listed below:

SJ-20191220164305-036 | 2020-01-10 (R1.0) 1-7