Professional Documents
Culture Documents
Pubm Ume r18 Alarm-Handling en
Pubm Ume r18 Alarm-Handling en
Version: V16.19.40
ZTE CORPORATION
No. 55, Hi-tech Road South, ShenZhen, P.R.China
Postcode: 518057
Tel: +86-755-26771900
URL: http://support.zte.com.cn
E-mail: support@zte.com.cn
LEGAL INFORMATION
Copyright 2020 ZTE CORPORATION.
The contents of this document are protected by copyright laws and international treaties. Any reproduction
or distribution of this document or any portion of this document, in any form by any means, without the
prior written consent of ZTE CORPORATION is prohibited. Additionally, the contents of this document
All company, brand and product names are trade or service marks, or registered trade or service marks,
This document is provided as is, and all express, implied, or statutory warranties, representationsor
conditions are disclaimed, including without limitation any implied warranty of merchantability, fitness for
a particular purpose, title or non-infringement. ZTE CORPORATION and its licensors shall not be liable
for damages resulting from the use of or reliance on the information contained herein.
ZTE CORPORATION or its licensors may have current or pending intellectual property rights or
applications covering the subject matter of this document. Except as expressly provided in any written
license between ZTE CORPORATION and its licensee, the user of this document shall not acquire any
ZTE CORPORATION reserves the right to upgrade or make technical change to this product without
further notice.
Users may visit the ZTE technical support website http://support.zte.com.cn to inquire for related
information.
Revision History
I
2.19 0050 Business performance threshold exceeded.............................................2-18
2.20 0052 Middleware performance threshold exceeded......................................... 2-19
2.21 0053 Framework performance threshold exceeded......................................... 2-19
2.22 0054 self-maintain performance threshold exceeded....................................... 2-20
2.23 1513 Performance threshold alarm...................................................................2-20
2.24 3001 Abnormal Minion Status...........................................................................2-21
2.25 3002 Cluster status abnormal...........................................................................2-22
2.26 4002 Insufficient Tenant Quota......................................................................... 2-23
2.27 5001 Certificate Will Expire Soon..................................................................... 2-24
2.28 5002 Certificate Expired....................................................................................2-24
2.29 5101 Create Project Failed............................................................................... 2-24
2.30 5102 Delete Project Failed............................................................................... 2-25
2.31 7001 Storage cluster status abnormal.............................................................. 2-26
2.32 7002 Cluster Capacity Usage Exceeded the Threshold................................... 2-27
2.33 7003 Volume Capacity Usage Exceeded the Threshold...................................2-27
2.34 8501 NBM Initialization Failed.......................................................................... 2-28
2.35 9101 PostgreSQL database cluster unavailable...............................................2-28
2.36 9102 PostgreSQL database cluster contains unavailable nodes...................... 2-29
2.37 9103 PostgreSQL database cluster replication interrupts or produces brain-
split..................................................................................................................... 2-30
2.38 9104 PostgreSQL database master and standby cluster replication
interruption..........................................................................................................2-31
2.39 9105 PostgreSQL database failed to archive log file........................................2-32
2.40 9141 Index is damaged in common service Elasticsearch............................... 2-33
3 OMC Alarm.................................................................................................3-1
3.1 1000 User locked.................................................................................................. 3-2
3.2 1001 Hard disk usage of database server overload............................................. 3-2
3.3 1002 CPU usage of application server overload.................................................. 3-3
3.4 1003 RAM usage of application server overload..................................................3-3
3.5 1004 Application server disk-overload.................................................................. 3-3
3.6 1008 Database instance space usage too large.................................................. 3-4
3.7 1012 License is expired........................................................................................3-4
3.8 1013 License is about to expire........................................................................... 3-5
3.9 1015 The link between the server and the ME agent is broken........................... 3-5
3.10 1017 The time in which the designated alarm remains active has expired......... 3-6
3.11 1018 The time in which the designated alarm remains unacknowledged has
expired.................................................................................................................. 3-6
II
3.12 1022 Merge rule root alarm................................................................................ 3-7
3.13 1023 Suppress plan task.................................................................................... 3-7
3.14 1025 Automatic backup failure........................................................................... 3-8
3.15 1028 Alarm forwarding failure.............................................................................3-8
3.16 1034 License consumption exceeds the alarm threshold................................... 3-9
3.17 1035 License consumption exceeds the total authorization............................... 3-9
3.18 1050 Wrong login password............................................................................... 3-9
3.19 1060 The number of users assigned the specific type exceeds the limit.......... 3-10
3.20 1061 The number of users assigned the specific type is about to exceed the
limit..................................................................................................................... 3-10
3.21 1300 Password has expired............................................................................. 3-11
3.22 1301 Password will expire................................................................................ 3-11
3.23 1310 The number of login users exceeds the limit...........................................3-11
3.24 1311 SNMP authentication failure.....................................................................3-12
4 Communication Alarm.............................................................................. 4-1
4.1 1014 The link between the server and the ME is broken..................................... 4-1
4.2 1040 ME or agent backend start failure............................................................... 4-1
4.3 200204012 S1 link is broken................................................................................ 4-2
4.4 200204013 Power supply failure.......................................................................... 4-2
4.5 200204014 Transport failure................................................................................. 4-3
5 Processing Error Alarm............................................................................5-1
5.1 0502 K8s schedule failed..................................................................................... 5-1
5.2 0503 K8s create pod failed...................................................................................5-3
5.3 0504 Failed to Delete a Pod................................................................................ 5-3
5.4 1014 Abnormal Service Operational Status..........................................................5-4
5.5 1015 Abnormal Mircroservice Operational Status................................................ 5-5
5.6 2001 Add network for Pod error........................................................................... 5-6
5.7 2002 IaaS account authentication failed...............................................................5-7
5.8 8001 Commonservice deployed failed..................................................................5-7
5.9 9302 Failed to synchronize data to slave zone.................................................... 5-8
6 Environment Alarm................................................................................... 6-1
6.1 9121 FTP disk space is insufficient...................................................................... 6-1
6.2 9122 FTP disk read and write exception..............................................................6-2
6.3 9201 Common Service Kafka node is offline....................................................... 6-2
6.4 9301 The connection for geographical disaster recovery is broken......................6-3
7 Integrity Violation Alarm...........................................................................7-1
7.1 15010001 Alarm for Missing of PM Data............................................................. 7-1
III
7.2 15010002 Alarm for Missing of NAF PM Data..................................................... 7-1
Glossary............................................................................................................. I
IV
About This Manual
Purpose
The ElasticNet UME R18 (hereinafter referred to as the UME) is a RAN element
management system.
This manual provides a reference for alarms related to the UME system. For alarms
related to a specific NE, refer to the corresponding user manual of the NE.
Intended Audience
Chapter 1, Equipment Alarm Provides a reference for equipment alarms related to the UME
system.
Chapter 2, QoS Alarm Provides a reference for QoS alarms related to the UME system.
Chapter 3, OMC Alarm Provides a reference for network management alarms related to the
UME system.
Chapter 4, Communication Provides a reference for communication alarms related to the UME
Alarm system.
Chapter 5, Processing Error Provides a reference for processing error alarms related to the UME
Alarm system.
Chapter 6, Environment Provides a reference for processing environment alarms related to the
Alarm UME system.
Chapter 7, Integrity Violation Provides a reference for integrity violation alarms related to the UME
Alarm system.
Related Documentation
V
ElasticNet UME R18 Unified Management Expert System Alarm Management Operation
Guide
Conventions
VI
Chapter 1
Equipment Alarm
Table of Contents
0001 Container Startup Failed.......................................................................................1-1
0011 Node Heartbeat lost..............................................................................................1-2
0051 Hardware performance threshold exceeded........................................................ 1-3
0069 Hardware Alarm....................................................................................................1-4
3003 Component Instance State Exception.................................................................. 1-4
7004 NFS Shared Volume Multi-Mounted In The Cluster............................................. 1-5
7005 Read-Only NFS Shared Volume.......................................................................... 1-5
9031 One Port of the Bond Group Fault....................................................................... 1-5
9032 All Ports of the Bond Group Fault........................................................................ 1-6
9033 OVS Service Fault................................................................................................ 1-6
9321 Platform pg database is unusable........................................................................1-7
9322 Plateform pg node instance is abnormal..............................................................1-7
9323 pacemaker cluster heartbeat is abnormal............................................................ 1-8
Alarm Cause
Action
1. On the Application Manager page, click an application to enter the details page.
Click the Alarm tab. View the current alarms and check whether there is the “Pod
network configuration failure” alarm.
a. Yes -> Handle the fault based on related handling suggestions.
b. No -> Step 2.
2. If this application is not deployed through blueprint, go to Step 3.
3. If this application is deployed through blueprint, check whether the blueprint
container images is correct.
4. Select Software Repository > Blueprint , and click a blueprint used by the
application. Click the used blueprint version. In the Action column, select Edit to
enter the blueprint editing page. Check whether the Pod container image name and
version number exist in the image repository.
a. No -> Modify the Pod container image and save it. In the Action column, click
Deploy to re-deploy the application and delete the original application.
b. Yes -> Step 3.
5. Check whether the application itself is abnormal.
6. On the Application Manager page, click Application Name to enter the microservice
page. Click Microservice Name to enter the Pod page. Click Pod Name to enter the
container page. Click the Container tab. Click Container Name to enter the container
details page. Select the Log tab. Check whether the application is abnormal in
accordance with the container logs. If there is no log or you cannot determine,
contact ZTE technical support.
Alarm Cause
Action
1. Log in to the PaaS Controller node, and execute ssh ubuntu@IP address of the
control node.
a. Successful login -> Step 2.
b. Login failure -> Step 4.
2. Log in to the abnormal node. Assume that the node name is default-
np-5-192.173.0.57.
a. You can log in to the node by executing ssh ubuntu@192.173.0.57.
b. Successful login -> Step 3.
c. Login failure -> Step 4.
3. Restart the heartbeat handshaking component.
a. Switch the root permission: Sudo su.
b. Execute the service heartbeat restart command. Wait for 5 minutes, and check
whether the alarm is cleared.
c. Yes -> End.
d. No -> Step 4.
4. Restart the abnormal node.
a. If Non preset senario,Then Select Resources > Compute > Nodes . In the search
box, enter '192.173.0.57' in the alarm object name to find the abnormal node.
Click the restart button of the node.
b. If Preset senario,Then Siwtch to the root permission:sudo su, and execute reboot
to reboot the abnormal node.
5. Check whether the alarm is cleared.
a. Yes -> End.
b. No -> Contact ZTE technical support.
Alarm Cause
The hardware performance indicator exceeds the threshold within the specified
inspection period.
Action
Alarm Cause
Action
Check the running status of the hardware equipment according to the alarm location
information. If there is no log or cannot be judged, contact ZTE technical support
personnel for processing.
Alarm Cause
Action
Can not recover more than 10 minutes, please contact ZTE technical support staff.
Alarm Cause
Action
Alarm Cause
Action
Alarm Cause
Action
Check 9031 alarm specific reasons, then solve the Ethernet interface faulty by the
People Repair.
Alarm Cause
Action
Check 9032 alarm specific reasons, then solve the Ethernet interface faulty by the
People Repair.
Alarm Cause
Action
Alarm Cause
Action
Can not recover more than 10 minutes, please contact ZTE technical support to repair
pg database failure.
Alarm Cause
Action
Can not recover more than 20 minutes, please contact ZTE technical support to repair
pgl database failure.
Alarm Cause
Action
Can not recover more than 20 minutes, please contact ZTE technical support to repair
pacemaker cluster node failure.
Alarm Cause
Action
2. Check whether the CPU usage trend of the node is consistent with the service status
based on the analysis of the on-site services.
Yes -> Step 3.
No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
Yes -> Step 4.
No -> Solve the problems with the service.
4. Determine whether to increase the CPU usage threshold in accordance with the
service status.
If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
If the traffic increases in a long time, it is suggested to adjust the CPU usage QoS
threshold.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the CPU Usage Rate line and then modify the
thresholds at all levels.
Alarm Cause
Action
a. Login Portaladmin web -> "Monitor" -> "Alarm" -> "Current Alarm", click the alarm
name to enter the detail page.
b. Click "Go to check" to enter the node information page.
c. Click "History Performance", select a time tab to check the Memory usage.
2. Based on the analysis of the on-site business, confirm whether the Memory usage is
consistent with the business status.
Yes → Go to step 3.
No → Contact ZTE technical support.
3. Confirm that whether the service status is normal.
Yes → Go to step 4.
No → Solve the problems with the service.
4. Determine whether to increase the memory usage threshold in accordance with the
service status.
If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
If the traffic increases in a long time, it is suggested to adjust the memory usage
QoS threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the Memory Usage Rateline and then modify the
thresholds at all levels.
Alarm Cause
Action
Alarm Cause
Action
Alarm Cause
Action
Alarm Cause
Action
Alarm Cause
Action
Alarm Cause
Action
Alarm Cause
The disk partition usage of the node exceeds the QoS threshold.
Action
4. Determine whether to increase the disk partition usage threshold in accordance with
the service status.
If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
disk partition . Click the Modify button in the Disk Partition Usage Rate line and then
modify the thresholds at all levels.
2.10 0030 The time offset from the NTP server is too large
Alarm Information
Alarm Cause
Action
b. Start the NTP service of the node with the command “systemctl start ntpd.
service”.
c. Use the command “systemctl status ntpd.service” to check whether the NTP
service of the node is successfully started (the status is active).
3. If the NTP server is normal, the NTP service of the node may be abnormal. Please
contact technical support.
Alarm Cause
The CPU Iowait usage of the node exceeds the QoS threshold.
Action
3. Ask the project administrator of the service to check whether the service status is
normal.
Yes -> Step 4.
No -> Solve the problems with the service.
4. Determine whether to increase the CPU Iowait Usage threshold in accordance with
the service status.
If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the CPU Iowait Usage Rate line and then modify the
thresholds at all levels.
Alarm Cause
The CPU usage of the Steal process of the node exceeds the QoS threshold.
Action
c. Click the History Performance tab. Select a time period and view the trend graph
of ** CPU Steal Usage Rate** .
2. Check whether the CPU usage trend of the Steal process is consistent with the
service status based on the analysis of the on-site services.
Yes -> Step 3.
No -> Contact ZTE technical support.
3. Ask the project administrator of the service to check whether the service status is
normal.
Yes -> Step 4.
No -> Solve the problems with the service.
4. Determine whether to increase the CPU Steal Usage threshold in accordance with
the service status.
If the traffic increases sharply in a short time, modification is not recommended.
When the traffic decreases to normal level, the alarm is cleared automatically.
If the traffic increases in a long time, it is suggested to adjust the disk usage QoS
threshold and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the CPU Steal Usage Rate line and then modify the
thresholds at all levels.
Alarm Cause
The system PID usage of the node exceeds the QoS threshold.
Action
Alarm Cause
Action
2.15 0036 Node can not synchronize time with NTP server
Alarm Information
Alarm Cause
Action
1. On the platform O&M portal, select Monitor > Alarm > Current Alarm . If it displays
“NTP daemon exit” or “NTP offset high”, refer to the alarm handling suggestions.
Otherwise, go to Step 2.
2. Check whether the NTP service of this node is normal.
a. SSH login to the alarm node and switch to the root user
b. Check if the NTP service is running normally
systemctl status ntpd.service
Check if the service is active. If it is not active, please contact the administrator.
3. Check whether the network between the node and the NTP server is connected.
a. SSH login to the alarm node and switch to the root user
b. Use the ping command to check whether the network between the node and the
NTP server is connected. If there are multiple servers, ping them one by one. If
the ping fails, solve the network problem first;
cat /etc/ntp.conf |grep "^server" |grep -v "127.127.1.0"
server 10.30.1.105 minpoll 3 maxpoll 4
#In the example, 10.30.1.105 is the NTP server.
ping 10.30.1.105
4. Contact the administrator to confirm that the NTP server is normal. If it is not normal,
first solve the problem of the NTP server, and then observe whether the alarm is
restored.
5. Contact ZTE technical support.
The value of rx rate for node network is too high, and alarms at different levels are
raised dynamically:
→ When the rate reaches 300000000Bps, a warning alarm is raised.
→ When the rate reaches 500000000Bps, a minor alarm is raised.
→ When the rate reaches 750000000Bps, a major alarm is raised.
→ When the rate reaches 900000000Bps, a critical alarm is raised.
Alarm level: Undefined
Alarm type: QoS Alarm
Alarm Cause
Action
Alarm Cause
Action
If the traffic increases in a long time, it is suggested to adjust the QoS threshold of
network tx rate and go to Step 5.
5. Select Settings > Alarm management . Click the QoS Manage tab and unfold the
node . Click the Modify button in the Network Tx Rate line and then modify the
thresholds at all levels.
Alarm Cause
The time difference between the node and the NTP server is too large.
The NTP service is abnormal.
Action
1. Check the time difference between the node and the clock source to determine
whether it is because the time difference is too large. For the check method and
processing method, refer to the handling suggestions of the alarm “The time offset
from the NTP server is too large”.
2. If the time difference is too large, contact the administrator. If it is for other reasons,
contact technical support.
Alarm Cause
The business performance indicator exceeds the threshold within the specified
inspection period.
Action
Alarm Cause
The business performance indicator exceeds the threshold within the specified
inspection period.
Action
Alarm Cause
The framework performance indicator exceeds the threshold within the specified
inspection period.
Action
Alarm Cause
The self-maintain performance indicator exceeds the threshold within the specified
inspection period.
Action
Alarm Cause
None.
Action
Alarm Cause
Action
b. After the alarm 2001 is cleared, check whether this alarm is cleared. Yes ->
End. No -> go to Step 3.
3. Contact the administrator.
If the alarm is “Minion is absent”.
1. Delete the node from the cluster.
a. The value of the object ID in the alarm information is the node’s uuid.
b. View the uuid of the home cluster in the “Extra Params” in the “Detail
Information” box.
c. Delete the node from the cluster by using the command.
Log in to the control node, and swithc to the root user. Enter the command to
delete the node.
sudo su
cluster delete <cluster_uuid> node <node_uuid>
d. After the node is deleted, check whether the alarm is cleared. No -> go to
Step 2.
2. Contact the administrator.
Alarm Cause
More than half of the control nodes in the cluster cannot provide services.
There is no available working node in the cluster.
Action
1. On the platform O&M portal, select Environment >Business Cluster . The Detail page
is displayed. Click the Node tab and view the nodes and roles under the cluster.
If the minion node does not exist in the cluster, go to Step 2.
If the minion node exists in the cluster, go to Step 3.
2. Expand the capacity of the cluster and add the minion node,
a. Click the Scale out button on the page described in Step 1, and add the minion
node.
b. Wait for the minion node to be deployed, and check whether this alarm is cleared.
Yes -> End.
3. On the platform O&M portal, select Monitor > Alarm > Current Alarm . Check whether
the “Abnormal cluster node status” alarm exists.
Yes -> go to Step 4.
No ->Contact ZTE technical support.
4. View the additional information of the “Abnormal cluster node status” alarm and
check whether the home cluster of the node is the cluster that raises the alarm.
Yes -> Step 5.
No -> Contact ZTE technical support.
5. For each node in the cluster that raises the “Abnormal cluster node status” alarm,
follow the alarm handling suggestions. Wait until the “Abnormal cluster node status”
alarm is cleared, and check whether this alarm is cleared.
Yes -> End.
No -> Contact ZTE technical support.
Alarm Cause
The remaining disk quota of the tenant is less than 10% of the total quota.
Action
Alarm code:5001
Alarm description:When the user’s certificate file will expire soon.
Alarm level: Major
Alarm type: QoS Alarm
Alarm Cause
Action
select Settings->Cert Manager, and check the certificate files according to the Alarm
Information,and then click Update button,Update the certificate acoording to page tips.
Alarm Cause
certificate expired.
Action
1. select Projects & Users -> Projects,check the failed cause of creating project,click
Retry to recreate the project.
2. If you determine that you cannot do nothing with the problem, contact ZTE technical
support.
Alarm code:5101
Alarm description:When create project failed.
Alarm level: Critical
Alarm Cause
Action
1. select Projects & Users -> Projects,check the failed cause of creating project,click
Retry to recreate the project.
2. If you determine that you cannot do nothing with the problem, contact ZTE technical
support.
Alarm Cause
Action
1. On the platform O&M portal, select Resources > Storage . Click Built-in storage
cluster , and check whether the Status column is healthy .
Yes -> Contact ZTE technical support.
No -> Step 2.
2. Wait five minutes and check whether the storage cluster heals itself. After five
minutes, check whether the Status column is healthy .
Yes -> End.
No -> Step 3.
3. If the Status column of a cluster is unhealthy , click the cluster name to enter the
storage cluster details page. Observe the Node information list.
If the Status column of each node is normal, contact ZTE technical support.
Alarm Cause
Action
1. On the platform O&M portal, select Resources > Storage . Click Built-in storage
cluster , and check whether the Status column is healthy .
Yes -> Contact ZTE technical support.
No -> Step 2.
2. Wait five minutes and check whether the storage cluster heals itself. After five
minutes, check whether the Status column is healthy .
Yes -> End.
No -> Step 3.
3. If the Status column of a cluster is unhealthy , click the cluster name to enter the
storage cluster details page. Observe the Node information list.
If the Status column of each node is normal, contact ZTE technical support.
If the Status column of a node is abnormal, go to Step 4.
4. If the Status column of a node is abnormal, click the Restart button in the Action
column of the node to manually complete the forced recovery of the storage cluster.
Wait five minutes and observe the Status column of the node. If it is normal and the
Status column of the corresponding storage cluster is healthy , the alarm handling is
completed. Otherwise, contact ZTE technical support.
Alarm Cause
The used capacity of the cluster exceeds 80% of the total capacity
Action
Alarm Cause
The used capacity of the volume exceeds 80% of the total capacity
Action
If no, ask the platform administrator to add a storage device to expand the storage
volume capacity.
Alarm Cause
Action
Contact ZTE technical support to check whether the rabbitmq service is normal.
Alarm Cause
Nodes with PG status of LATEST or SYNC in cluster fail to start for some reason, while
other PGs in other states can start normally, but without the right to be promoted, the
whole cluster can not choose the master node.
Action
First, check the start log of PG with status of LATEST or SYNC, or start PG manually
through PSQL client to find the cause of start failure.
If it can’t start at all, it can only start the PG with status of non-LATEST or non-SYNC.
But at this time, the PG with status
of non-LATEST or non-SYNC may have fewer data than that with status of LATEST or
SYNC, and force the start of non-LATEST or non-SYNCmay result in a small amount of
data loss.
For other information, please contact ZTE technical support staff.
Alarm Cause
Action
1. Use command crm status checks the status of the cluster, if there is master and
standby node also starts normally, but the stream replication is abnormal.
Through Self-Management Entry - > pg-mng: Click on Enter PG Manager page,
click the problem node with pull the data to pull data from master node manually.
2. Master exists and standby node failed to start.
a. First, check the log of the failed PG, or start the PG manually through the PSQL
client to find the cause of the failure.
b. If the standby node can’t start at all, select the problem node to pull the data
manually
c. through self-management entrance - > pg-mng: click on to enter PG Manager
page .
3. Without master, the PG with LATEST or SYNC status did not start successfully.
a. First, check the start-up log of PG with status of LATEST or SYNC, or start PG
manually with PSQL client to find the cause of start-up failure.
b. If it can’t start completely, restore database if the backup is available. If restore is
not the option, it can only start the PG with status of non-LATEST or non-SYNC.
c. But at this time, the PG with status of non-LATEST or non-SYNC may have fewer
data than that with status of LATEST or SYNC,
d. and force the start of non-LATEST or non-SYNC may result in a small amount of
data loss.
Please contact ZTE Communications Technical Support to check whether service is
normal.
Alarm Cause
The master node failed to be promoted, and there was no master node providing
external services.
Action
In PaaS Operations and Maintenance Interface Monitoring - > Alarm - > Details Page
Details Item, check 9103 alarm specific reasons.
If the reason is shown as Need People Repair, Streaming is break, You can use
portal to pull full data from Master.
It indicates that the stream replication is broken and the data needs to be pulled
manually through the management interface.
Through Self-management Entry - > pg-mng : Click on to enter PG Manager page,
select the problem node to pull the data manually.
If the reason is shown as it is possible to have a split brain, keep ban status.
indicates that the time line of the original
master node and the new master node is the same, it may be the case of brain-split,
which requires manual comparison and merging of data.
If the reason is shown as It is possible to have a split brain, keep ban status. Marks
Maybe data loss or database abnormality,
need Repair, keep ban status. indicates that the difference between master and standby
node data is greater than the preset value, it needs manual comparison and merging
data.
Alarm Cause
The stream replication of disaster recovery cluster and master cluster is interrupted and
disaster recovery fails.
Action
Alarm Cause
Action
117 full walname is not found Config wal full path and name
Alarm code:9141
Alarm description:When index status is red in common service Elasticsearch, the
index is damaged and this alarm is generated.
Alarm level: Undefined
Alarm type: QoS Alarm
Alarm Cause
Action
1. Check whether there are related alarms on the application management page.
a. In PaaS Operations and Maintenance Interface ** Monitoring - > Alarm - > Details
** Page ** Details ** Item, get Object ID.
b. In opcs project ** Application Manager ** page, find the application named “
commsrves-<Object ID>”. Enter the ** Alarm ** page of the application, and
check whether there is any current alarm.
Yes -> Handle the fault based on related handling suggestions.
No -> please contact ZTE technical support.
2. For other information, please contact ZTE technical support.
Alarm Cause
None.
Action
Check and analyze the login log to find whether the problem is caused by a password
guessing attack. If no, contact the system administrator for unlocking the user account.
Alarm Cause
The disk space occupied by audit logs exceeds the threshold "Lower Clean Percent".
The disk space occupied by program logs exceeds the threshold "Lower Clean
Percent".
Action
Alarm Cause
None.
Action
1. Check that the load of the UME is within the allowable range.
2. Check whether any unnecessary applications are running on the UME server. If yes,
exit those unnecessary applications.
Alarm Cause
None.
Action
1. Check that the load of the UME is within the allowed range.
2. Check whether any unnecessary applications are running on the UME server. If yes,
exit those applications to release some RAM.
3. Expand the RAM of the application server.
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
3.9 1015 The link between the server and the ME agent is
broken
Alarm Information
Alarm Cause
None.
Action
Please check the link between the server and the agent. Check the connection as
follows:
1. On the Alarm Monitor interface, view the detail of the alarm to find the information
of the agent which the link is broken. Go to the proxy access UI, check the proxy
address information in the details of the corresponding proxy.
2. Verify that the network is faulty or not. If the network is faulty, please configure the
firewall.
3. Otherwise, restart the agent.
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
1. This alarm in the current alarm table to show all merged alarms.
2. Find the handling suggestion of each merged alarm by its alarm code, and then
handle the corresponding alarm according to the suggestion.
Alarm Cause
None.
Action
and switchover, users should clear this alarm. If the equipment alarms suppressed by
this alarm are already cleared, they do not need to be handled again. If some equipment
alarms are not cleared yet, users need to check and handle these equipment alarms.
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
Alarm Cause
None.
Action
The number of login users exceeds the threshold. Please deal with the number of login
users and avoid abnormal system.
Alarm Cause
None.
Action
4.1 1014 The link between the server and the ME is broken
Alarm Information
Alarm Cause
None.
Action
Do the following to check whether the link between the server and the ME is normal:
1. Find the IP address of the ME on the Topo Management page.
2. Ping the IP address of the ME from the server.
3. If the second step fails, then you need to solve the network or the ME problems.
Alarm Cause
None.
Action
Alarm Cause
S1 link is broken
Action
Based on the network planning, check whether the settings of the IP address and the
route of each node (such as the BSC, RNC, and switching devices) over the transport
path are correct.
Alarm Cause
Action
Check whether the power supply equipment in the equipment room is normal or not.
Alarm Cause
Transport failure
Action
Alarm Cause
Action
1. Login URL "http://[ip address]/portaladmin", open "Monitor" -> "Alarm" -> "Current
Alarm" tab.
If there has "K8s Report Node Has Insufficient Memory" or "K8s Report Node Has
Disk Pressure", to deal with the alarm according to suggestion.
If none of above, turn to step 2.
2. Open "Resource" -> "Compute" -> "Nodes", click Kubernetes node name to enter
detail page, enter "Resources Monitor" page to check whether CPU/Memory is
satisfied by Pod.
Yes → Go to step 6.
No → Go to step 3.
3. Verify that the CPU/memory resource requested by the application is adjustable.
Yes → Go to step 4.
No → Go to step 5.
4. Adjust the number of CPU/memory resources requested by the application and
redeploy the application. Login portal page, enter "AppManager", select the App
name, click the "Delete".
Open "Software Repository" -> "Image" page, find the corresponding image, click "
Deploy" button to redeploy the CPU and Memory. Or open "Software Repository" ->
"Blueprint" page, click the corresponding blueprint name, click "Edit" -> "Container"
icon -> "Advanced setting" -> "Configure Resources", modify the CPU/Memory
parameters, and redeploy the blueprint.
5. Increase node resources for clusters. Login Portaladmin page -> "Environment" -
> "Business Cluster" page, click cluster name -> "Nodes" -> "Scale out", fill in all
necessary parameters, and click "Scale out" button.
6. Check whether Pod affinity matches with the Pod label. Login Portaladmin page -> "
Environment" -> "Business Cluster" -> cluster name -> "Node".
Yes → Go to step 8.
No → Go to step 7.
7. Modify the node affinity configuration of the application and redeploy the application.
Open "AppManager" page, click "Delete" button to delete this App. Open "Software
Repository" -> "Image" page, find the image which is used by the application,
click "deploy" -> "Show Advanced Settings" -> "affinity config", fill in the necessary
parameters, the deploy the Image. Or open "Software Repository" -> "Blueprint"
page, find the blueprint which is used by the application, click "deploy" -> "Show
Advanced Settings", fill in the necessary parameters, and click "deploy".
8. Open "AppManager" page, check that whether all applications that have affinity/anti-
affinity relationship with this application are correct.
Yes → Please contact ZTE technical support.
No → Go to step 9.
9. Modify the node affinity configuration of the application and redeploy the application.
Open "AppManager" page, click "Delete" button to delete this App. Open "Software
Repository" -> "Image" page, find the image which is used by the application,
click "deploy" -> "Show Advanced Settings" -> "affinity config", fill in the necessary
parameters, the deploy the Image. Or open "Software Repository" -> "Blueprint"
page, find the blueprint which is used by the application, click "deploy" -> "Show
Advanced Settings", fill in the necessary parameters, and click "deploy".
Alarm Cause
Action
Login Portaladmin page-> "Monitor" -> "Alarm" -> "Current Alarm", check whether there
is a "cluster status abnormal" alarm.
Yes → Do the operation with suggestion of the "cluster status abnormal" alarm.
No → Please contact ZTE technical support.
Alarm Cause
Action
On the platform O&M portal, select Monitor > Alarm > Current Alarm . Check whether
the “Abnormal cluster status” alarm exists.
Yes -> Handle the fault based on related handling suggestions.
No -> Contact ZTE technical support.
Alarm Cause
Action
1. On the “Event” tab of the application details page, use “AppRunAbnormally” to filter
the events that are searched.
If it displays “could not run normally in appointed time” in the event description, go
to Step 2.
If it displays “select cluster fail” in the event description, go to Step 3.
If it displays “resource of tenant is not enough” in the event description, this
indicates the resource quota of the tenant is insufficient. Contact the platform
administrator.
For other displayed information, go to Step 4.
2. On the “Current Alarm” tab of the application details page, view the alarms.
If the “Kubernetes Failed to Dispatch Pod” alarm exists, handle the fault based on
related handling suggestions.
If the “Pod Network Configuration Failure” alarm exists, handle the fault based on
related handling suggestions.
If the “Failed to Create a Pod” alarm exists, handle the fault based on related
handling suggestions.
If the “Failed to Mount a Volume to the Pod” alarm exists, handle the fault based
on related handling suggestions.
If the above alarms do not exist, go to Step 4.
3. On the platform O&M portal, select Environment >Business Cluster . View the
information in the “Available status” column.
It displays “Yes” -> Go to Step 4.
It displays “No” - > On the platform O&M portal, select Monitor > Alarm >
Current Alarm . If the “Abnormal cluster status” alarm exists, handle the fault
based on related handling suggestions.
4. Attempt to analyze the cause of the failure according to the details of the
AppRunAbnormally event.
Clearly describe the cause of the failure and contact the platform administrator to
fix the failure.
If the cause of the failure is unclear, contact ZTE technical support.
Alarm Cause
Action
1. On the “Event” tab of the application details page, use “AppRunAbnormally” to filter
the events that are searched.
If it displays “could not run normally in appointed time” in the event description, go
to Step 2.
If it displays “select cluster fail” in the event description, go to Step 3.
If it displays “resource of tenant is not enough” in the event description, this
indicates the resource quota of the tenant is insufficient. Contact the platform
administrator.
Alarm Cause
In underlay scenario, because of insufficient port resource quota of IaaS, the creation
of network ports by PaaS network components failed. The PaaS network misses the
network specified in the Pod blueprint.
Action
1. According to the IaaS tenant used by PaaS, modify the resource tenant quota
in the IaaS environment, Contact IaaS administrator to modify the tenant quota
configuration.
2. Check whether the PaaS network has created the network planned for use in the
Pod blueprint. Open Portaladmin system -> "Resources" -> "Network" page, check
whether the network is created. If no, click "Create Network" to add a new one.
Alarm Cause
Action
Alarm Cause
Download blueprint failed, Create PVC failed, Create IPGroup failed, Deploy pdm/vnpm
server failed, Deploy broker failed.
Action
1. If the detail information of the alarm is download BluePrint failed , check "Software
Repository"->"Blueprint" , according to the deployed commonservice name and
version number to check whether the corresponding commonservice blueprint exists.
No → Please contact the administrator to upload the blueprint version.
Yes → Please contact the administrator to confirm that whether the software
repository is normal.
2. If the detail information of the alarm is create PVC failed, check share storage node.
Please contact the administrator to confirm whether the environment has storage
clusters or volume capacity resources are out of limit.
3. If the detail information of the alarm is NW create ipgroup failed, check the network.
Please contact the administrator to confirm the network.
4. If the detail information of the alarm is vnpm deploy server failed, check events from
VNPM. Check "Monitor"->"Alarm"->"Current Alarm" page to see whether there is a "
cluster status abnormal" alarm.
Yes → Click "Alarm Name" to see the "detail information"->"Suggestion", to deal with
this alarm.
No → Go to step 6.
5. If the detail information of the alarm is vnpm deploy broker failed, check events from
VNPM. Same operation with step 4.
6. Please contact ZTE technical support.
Alarm Cause
The master zone failed to send data to slave zone probably caused by factors listed
below.
A network failure.
A network congistionn.
The peer entity in slave zone is offline.
Action
Contact ZTE technical support to check the component status and net configurations for
disaster recovery and repair.
Alarm code:9121
Alarm description:
Excessive FTP disk space usage is causing this alarm.
As FTP disk space usage increases, this alarm level will be dynamically adjusted.
The default threshold is as follows:
→ Usage rate reaches 70%, report major alarm, at this time FTP function has no
real impact, designed to remind users to clean up the disk space in time. Clear
alarms when usage drops below 60%.
→ Usage rate reaches 90%, report critical alarm, at this time, the FTP server will
be set to read only, only to view, download and delete files, not to upload files.
Immediately clean up the FTP disk space. When the utilization rate drops below
80%, the alarm level becomes important. FTP resumes writable capacity.
The above threshold can be customized by users in the FTP administration page.
Alarm level: critical
Alarm type: QoS Alarm
Alarm Cause
Action
Contact ZTE technical support to clean up unused data on the FTP server.
Alarm code:5101
Alarm description:Report this warning when a Shared volume or local disk attached
to the FTP service cannot be read or written properly..
Alarm level: critical
Alarm type: QoS Alarm
Alarm Cause
Action
Alarm code:9201
Alarm description:When create project failed.
Alarm level: Major
Alarm type: QoS Alarm
Alarm Cause
Action
Alarm code:9301
Alarm description:Report this warning when the connection for geographical disaster
recovery is broken down.
Alarm level: critical
Alarm type: QoS Alarm
Alarm Cause
Action
Contact ZTE technical support to check net configurations for disaster recovery and
repair.
Alarm Cause
See Details.
Action
Alarm Cause
See Details.
Action
BSC
CPU
RNC
SNMP
UME