Download as pdf or txt
Download as pdf or txt
You are on page 1of 98

ECS

Version 3.5

Monitoring Guide
Rev01
May 2020
Copyright © 2019-2020 Dell Inc. or its subsidiaries. All rights reserved.

Dell believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS-IS.” DELL MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. USE, COPYING, AND DISTRIBUTION OF ANY DELL SOFTWARE DESCRIBED
IN THIS PUBLICATION REQUIRES AN APPLICABLE SOFTWARE LICENSE.

Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be the property
of their respective owners. Published in the USA.

Dell EMC
Hopkinton, Massachusetts 01748-9103
1-508-435-1000 In North America 1-866-464-7381
www.DellEMC.com

2 ECS Monitoring Guide


CONTENTS

Figures 5

Tables 7

Chapter 1 Monitoring Basics 9


View the ECS Portal Dashboard........................................................................ 10
Upper-right menu bar........................................................................... 10
View requests....................................................................................... 10
View capacity utilization....................................................................... 10
View performance................................................................................. 11
View storage efficiency......................................................................... 11
View geo monitoring..............................................................................11
View node and disk health..................................................................... 11
View alerts............................................................................................ 12
View audits........................................................................................... 12
Using monitoring pages..................................................................................... 12
Table navigation....................................................................................12
Filter by date and time.......................................................................... 12
History..................................................................................................13
Export icon........................................................................................... 14

Chapter 2 Monitoring ECS 15


Monitor metering data.......................................................................................16
Metering data....................................................................................... 17
Monitor capacity utilization............................................................................... 18
Read-only system................................................................................. 18
Capacity forecast................................................................................. 18
Monitor capacity.................................................................................. 19
Monitor used capacity..........................................................................22
Monitor garbage collection data...........................................................22
Monitor erasure encoding.................................................................... 23
Monitor CAS processing...................................................................... 24
Monitor system health...................................................................................... 25
Monitor hardware health......................................................................25
Monitor process health........................................................................ 26
Monitor node rebalancing status.......................................................... 28
Monitor transactions.........................................................................................28
Monitor recovery status....................................................................................29
Monitor disk bandwidth.................................................................................... 29
Introduction to geo-replication monitoring........................................................29
Monitor geo replication: Rate and Chunks............................................29
Monitor geo replication: Recovery Point Objective (RPO)...................30
Monitor geo replication: Failover Processing........................................30
Monitor geo replication: Bootstrap Processing..................................... 31
Cloud hosted VDC monitoring........................................................................... 32
Cloud topology..................................................................................... 32
Cloud replication traffic........................................................................33

ECS Monitoring Guide 3


Contents

Chapter 3 Monitoring Events: Audits and Alerts 35


About event monitoring.................................................................................... 36
Monitor audit data............................................................................................ 36
Audit messages.................................................................................................36
Monitor alerts.................................................................................................... 41
Alert policy........................................................................................................42
New alert policy................................................................................... 43
Acknowledge all alerts.......................................................................................44
Alert messages..................................................................................................44

Chapter 4 Advanced Monitoring 59


Advanced Monitoring........................................................................................60
View Advanced Monitoring Dashboards............................................... 60
Share Advanced Monitoring Dashboards..............................................74
Flux API.............................................................................................................74
Monitoring list of metrics..................................................................... 77
Monitoring list of metrics: Non-Performance....................................... 77
Monitoring list of metrics: Performance...............................................88
Flux API replacements for deprecated dashboard API..........................92
Dashboard APIs................................................................................................ 95

Chapter 5 Examining Service Logs 97


ECS service logs............................................................................................... 98

4 ECS Monitoring Guide


FIGURES

1 Upper-right menu bar........................................................................................................ 10


2 Refresh icon...................................................................................................................... 12
3 Open Filter panel with date and time range selections....................................................... 13
4 History chart with active cursor........................................................................................ 13
5 Export icons...................................................................................................................... 14

ECS Monitoring Guide 5


Figures

6 ECS Monitoring Guide


TABLES

1 Bucket and namespace metering....................................................................................... 17


2 Capacity utilization: VDC................................................................................................... 19
3 Capacity utilization: storage pool...................................................................................... 20
4 Capacity utilization: node...................................................................................................21
5 Capacity utilization: disk.................................................................................................... 21
6 Used capacity................................................................................................................... 22
7 Garbage collection: garbage detected...............................................................................23
8 Garbage collection: capacity reclaimed............................................................................. 23
9 Erasure encoding metrics..................................................................................................23
10 CAS processing metrics.................................................................................................... 25
11 VDC, node, and process health metrics.............................................................................26
12 ECS processes.................................................................................................................. 27
13 Rate and Chunks columns................................................................................................. 30
14 RPO columns.................................................................................................................... 30
15 Failover columns............................................................................................................... 30
16 Bootstrap Processing columns.......................................................................................... 31
17 Replication traffic by VDC.................................................................................................33
18 Replication traffic by replication group............................................................................. 33
19 ECS audit messages..........................................................................................................36
20 Alert types........................................................................................................................ 42
21 ESRS dial home types....................................................................................................... 42
22 ECS Object alert messages............................................................................................... 44
23 ECS fabric alert messages................................................................................................ 53
24 Secure Remote Services alert messages...........................................................................57
25 Advanced monitoring dashboards..................................................................................... 60
26 Advanced monitoring dashboard fields.............................................................................. 61
27 APIs removed in ECS 3.5.0............................................................................................... 95

ECS Monitoring Guide 7


Tables

8 ECS Monitoring Guide


CHAPTER 1
Monitoring Basics

l View the ECS Portal Dashboard.............................................................................................10


l Using monitoring pages..........................................................................................................12

ECS Monitoring Guide 9


Monitoring Basics

View the ECS Portal Dashboard


The ECS Portal Dashboard provides critical information about the ECS processes on the VDC you
are currently logged in to.
The Dashboard is the first page you see after you log in. The title of each panel (box) links to the
portal monitoring page that shows more detail for the monitoring area.

Upper-right menu bar


The upper-right menu bar appears on each ECS Portal page.
Figure 1 Upper-right menu bar

Menu items include the following icons and menus:


1. The Alert icon displays a number that indicates how many unacknowledged alerts are pending
for the current VDC. The number displays 99+ if there are more than 99 alerts. You can click
the Alert icon to see the Alert menu, which shows the five most recent alerts for the current
VDC.
2. The Help icon brings up the online documentation for the current portal page.
3. The Guide icon brings up the Getting Started Task Checklist.
4. The VDC menu displays the name of the current VDC. If your AD or LDAP credentials allow you
to access more than one VDC, you can switch the portal view to the other VDCs without
entering your credentials.
5. The User menu displays the current user and allows you to log out. The User menu displays the
last login time for the user.

View requests
The Requests panel displays the total requests, successful requests, and failed requests.
Failed requests are organized by system error and user error. User failures are typically HTTP 400
errors. System failures are typically HTTP 500 errors. Click Requests to see more request
metrics.
Request statistics do not include replication traffic.

View capacity utilization


The Capacity Utilization panel displays the total, used, available, reserved, and percent full
capacity.
Note: When the storage pool reaches 90% of its total capacity, it does not accept write
requests and it becomes a read-only system. A storage pool must have a minimum of four
nodes and must have three or more nodes with more than 10% free capacity in order to allow
writes. This reserved space is required to ensure that ECS does not run out of space while
persisting system metadata. If this criteria is not met, the write will fail. The ability of a storage
pool to accept writes does not affect the ability of other pools to accept writes. For example, if

10 ECS Monitoring Guide


Monitoring Basics

you have a load balancer that detects a failed write, the load balancer can redirect the write to
another VDC.
Capacity amounts are shown in gibibytes (GiB) and tibibytes (TiB). One GiB is approximately equal
to 1.074 gigabytes (GB). One TiB is approximately equal to 1.1 terabytes (TB).
The Used capacity indicates the amount of capacity that is in use. Click Capacity Utilization to
see more capacity metrics.
The capacity metrics are available in the left menu.

View performance
The Performance panel displays how network read and write operations are currently performing,
and the average read/write performance statistics over the last 24 hours for the VDC.
Click Performance to see more comprehensive performance metrics.
Note: There will be a label of SSD Cache Enabled if the feature is on the node. And if Read
Cache is disabled or the nodes do not have SSD disks there will be no SSD Cache Enabled
label.

View storage efficiency


The Storage Efficiency panel displays the efficiency of the erasure coding (EC) process.
The chart shows the progress of the current EC process, and the other values show the total
amount of data that is subject to EC, the amount of EC data waiting for the EC process, and the
current rate of the EC process. Click Storage Efficiency to see more storage efficiency metrics.

View geo monitoring


The Geo Monitoring panel displays how much data from the local VDC is waiting for geo-
replication, and the rate of the replication.
Recovery Point Objective (RPO) refers to the point in time in the past to which you can recover.
The value is the oldest data at risk of being lost if a local VDC fails before replication is complete.
Failover Progress shows the progress of any active failover that is occurring in the federation
involving the local VDC. Bootstrap Progress shows the progress of any active process to add a
new VDC to the federation. Click Geo Monitoring to see more geo-replication metrics.

View node and disk health


The Node & Data Disks panel displays the health status of disks and nodes.
A green check mark beside the node or disk number indicates the number of nodes or disks in good
health. A red x indicates bad health. Click Node & Data Disks to see more hardware health
metrics. If the number of bad disks or nodes is a number other than zero, clicking the count takes
you to the corresponding Hardware Health tab (Offline Data Disks or Offline Nodes) on the
System Health page.
Note: If the data form failed disks have already recovered and failed disks are ready for
replacement, they will not show in the Node & Data Disks panel. Click Manage Disks under
System Health to go to Maintenance, which indicates if there are disks that are ready for
physical replacement. Alternatively, access Maintenance using left panel menu, Manage >
Maintenance.

ECS Monitoring Guide 11


Monitoring Basics

View alerts
The Alerts panel displays a count of critical alerts and errors.
Click Alerts to see the full list of current alerts. Any Critical or Error alerts are linked to the Alerts
tab on the Events page where only the alerts with a severity of Critical or Error are filtered and
displayed.
Note: Alerts can also be filtered with Severity Info and Warning.

View audits
Audits can be filtered only with date time range and namespace.

Using monitoring pages


Introduces the basic techniques for using monitoring pages in the ECS Portal.
The ECS Portal monitoring pages share a set of common interactions as described in the following
sections:

Table navigation
Highlighted text in a table row indicates a link to a detail display. Selecting the link drills down to
the next level of detail. On drill-down displays, a path string shows your current location in the
sequence of drill-down displays. This path string is called a breadcrumb trail or breadcrumbs for
short. Selecting any highlighted breadcrumb jumps up to the associated display.
On some monitoring displays, you can force a table to refresh with the latest data by clicking the
Refresh icon.
Figure 2 Refresh icon

Filter by date and time


The standard monitoring filter enables to narrow results by date and time. It is available on several
monitoring pages. Some pages have more filter types, described on those pages.
You can select a Date Time Range predefined value (in hours, weeks, or months) or select Custom
to specify a From and To date and time. For the To value, you can select the current time. After
selecting a Date Time Range, and click Apply. The Filter panel closes and the page content
updates. When closed, the Filter panel shows a summary of the applied filter settings and provides
a Clear Filter command and a Refresh symbol.
If you want the Filter panel to stay open, click the Pin icon before you click Apply.

12 ECS Monitoring Guide


Monitoring Basics

Figure 3 Open Filter panel with date and time range selections

When the table has the Current filter applied, the latest values are displayed. When the table has a
date-time range filter applied, it displays the average value over that period.

History
When you select a History button, all available charts for that row are displayed below the table.
You can hover over a chart from left to right to see a vertical line that helps you find a specific
date-time point on the chart. A pop-up display shows the value and timestamp for that point.
The date-time scale is determined by the filter setting that has been configured. When the
Current filter is selected, the charts show data from the last 24 hours. History data is kept for 60
days.
Figure 4 History chart with active cursor

In the history charts, when the Current filter is selected, if there is no available historical data, No
Data displays.

ECS Monitoring Guide 13


Monitoring Basics

Export icon
Export icon enables you to export data from all the monitoring tables and graphs to pdf, doc, excel.
and .csv formats for later consumption. To select the format, and export the data, use the export
icon in the upper right of the menu bar on each table and graph.
The exported data can be used to get a longer term view on capacity usage and consumption
trends.
Figure 5 Export icons

14 ECS Monitoring Guide


CHAPTER 2
Monitoring ECS

l Monitor metering data........................................................................................................... 16


l Monitor capacity utilization....................................................................................................18
l Monitor system health...........................................................................................................25
l Monitor transactions............................................................................................................. 28
l Monitor recovery status........................................................................................................ 29
l Monitor disk bandwidth......................................................................................................... 29
l Introduction to geo-replication monitoring............................................................................ 29
l Cloud hosted VDC monitoring............................................................................................... 32

ECS Monitoring Guide 15


Monitoring ECS

Monitor metering data


You can display metering data for namespaces, or buckets within namespaces, for a specified time
period.
About this task
The available metering data is detailed in Metering data on page 17.
Using the ECS Management REST API you can retrieve data programmatically with custom
clients. The ECS Management REST API Reference is provided here.
Procedure
1. In the ECS Portal, select Monitor > Metering.
2. From the Date Time Range menu, select the period for which you want to see the metering
data. Select Current to view the current metering data. Select Custom to specify a custom
date-time range.
Metering is not a real-time reporting activity but is performed as a background process and
some delay in reporting can occur. The longest delay is about 15 minutes. However, where
the system is under heavy load, or is unstable, longer delays can be seen. If you are
encountering longer delays, contact ECS Customer Support.

If you select Custom, use the From and To calendars to choose the time period for which
data will be displayed.

Metering data is kept for 30 days.


Note: The Current filter displays the latest available values. A date-time range filter
displays average values over the specified range.

3. Select the namespace for which you want to display metering data. To narrow the list of
namespaces, type the first few letters of the target namespace and click the magnifying
glass icon.
If you are a Namespace Administrator, you will only be able to select your namespace.
4. Click the + icon next to each namespace you want to see object data for.
5. To see the data for a particular bucket, click the + icon next to each bucket for which you
want to see data.
To narrow the list of buckets, type the first few letters of the target bucket and click the
magnifying glass icon.
If you do not specify a bucket, the object metering data will be the totals for all buckets in
the namespace.
6. Click Apply to display the metering data for the selected namespace and bucket for the
specified time period.
Note: While all buckets in a geo-federation can be selected in metering, if a selected
bucket is not associated in a replication group to which the VDC that you are logged into
belongs, metering information cannot be retrieved for that bucket. In this case, after a
wait, the bucket is listed as No data. To get the metering information for the bucket,
log in to the VDC that owns the bucket or any VDC that is part of the replication group
to which the bucket belongs.
Depending on the Date Time Range selected, the attributes that are displayed in the
Metering Page may change. If Current option is selected, only Namespace, Buckets,
Bucket Tags, Total MPU Parts, Total MPU Size, Total Size, Object Count, and Last

16 ECS Monitoring Guide


Monitoring ECS

Updated attributes are displayed in the table. If Custom or any other time range is
chosen, the Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, Total
Size, Object Count, Objects Created, Objects Deleted, Write Traffic and Read Traffic
attributes are displayed in the table and the Last Updated attribute is not displayed.

Metering data
Object metering data for a specified namespace, or a specified bucket within a namespace, can be
obtained for a defined time period at the ECS Portal Monitor > Metering page.
The metering information that is provided is shown in the following table:

Table 1 Bucket and namespace metering

Attribute Description

Namespace Namespace selected.

Buckets Bucket selected for which the metering data applies. If blank, the data is for
all buckets in the namespace.

Bucket Tags Lists any name=value bucket tags associated with the bucket.

Total MPU Parts The number of MPU parts that have been created and not used as part of a
complete MPU operation.

Total MPU Size The total disk size occupied by MPU parts that have been created and not
used as part of a complete MPU operation.

Total Size Total size of the objects that are stored in the selected namespace or bucket
at the end time that is specified in the filter. If the size is less than 1 GB, then
the portal displays 0GB.

Object Count Number of objects that are associated with the selected namespace or
bucket at the end time that is specified in the filter.

Last Updated If the Current filter is selected, Last Updated displays the time until which
metering data can be considered consistent. This can help you determine
any delay in reported metering stats. The metering stats may include some
data on the operations that are performed after the last updated time.

Objects Created Number of objects that are created in the selected namespace or bucket in
the time period.

Objects Deleted Number of objects that are deleted from the selected namespace or bucket
in the time period.

Write Traffic Total of incoming object data (writes) for the selected namespace or bucket
during the specified period. Values are displayed in a size unit that is based
on the size of the data.

Read Traffic Total of outgoing object data (reads) for the selected namespace or bucket
during the specified period. Values are displayed in a size unit that is based
on the size of the data.

Note: When you perform an update operation on an object, the metering services shows
Object Overwrite as Objects Created and Objects Deleted. The Objects Deleted
is shown because of the expected OVERWRITE behavior of an object. However, no object is
deleted.

ECS Monitoring Guide 17


Monitoring ECS

Note: Metering is not a real-time reporting activity but is performed as a background process
and some delay in reporting can occur. The longest delay is about 15 minutes. However, where
the system is under heavy load, or is unstable, longer delays can be seen. If you are
encountering longer delays, contact ECS Customer Support.
Note: When there are many concurrent requests, ECS metering can ignore some requests so
that they do not impact system performance. Hence, the Write Traffic value can show less
that the actual Write bandwidth.

Monitor capacity utilization


You can monitor capacity utilization from the ECS Portal Monitor > Capacity Utilization page.
You can monitor the capacity utilization of storage pools, nodes and the entire VDC.
The Capacity Utilization page has the following tabs:
l Capacity: View summary data about the total, used, available, and reserved storage capacity of
storage pools and nodes
l Used Capacity: View data about the used capacity for the VDC and storage pools
l Garbage Collection: View data about garbage detected, recovered capacity, capacity that is
pending reclamation, and capacity that cannot be reclaimed
l Erasure Encoding: View erasure-encoded data in a local storage pool, data that is pending
erasure encoding, and the current erasure encoding rate and estimated completion time
l CAS Processing: View garbage data collection for CAS (Content Addressable Storage)
buckets.
Tables showing capacity usage data display in each of the tabs. You can look down into the nodes
and to individual disks by selecting the appropriate link in each table. Each row has an associated
History display that enables you to see how the data has changed over time. To graphically display
how capacity has changed over time, select History for the storage pool, node, or disk that you
are interested in. History data is kept for 30 days.
See Using monitoring pages for information about going to the tables.

Read-only system
When the storage pool reaches 90% of its total capacity, it does not accept write requests and it
becomes a read-only system. A storage pool must have a minimum of four nodes and must have
three or more nodes with more than 10% free capacity in order to allow writes. This reserved
space is required to ensure that ECS does not run out of space while persisting system metadata.
If this criteria is not met, the write fails. The ability of a storage pool to accept writes does not
affect the ability of other pools to accept writes. For example, if you have a load balancer that
detects a failed write, the load balancer can redirect the write to another VDC.

Capacity forecast
You can use the Capacity tab to monitor when the capacity is expected to reach 50% and 80%.
Capacity forecast is based on the current usage pattern that is shown on 1 day, 7 days, and 30-
days usage trend. Capacity Forecast data is shown either for the entire VDC, for an individual
storage pool or for nodes.
Note: The capacity ETA shown as N/A could be due to the following reasons:
1. There is not enough historical data for forecast. At least two data points (1 hour apart) are
required. It could happen when the ECS system is deployed. Click the History button at
VDC, storage pool, or node levels to verify.

18 ECS Monitoring Guide


Monitoring ECS

2. If capacity passed intended target, the ETA is set to 0.


3. The used capacity shows a down trend for the specified time (for example, 7 days). Click
the History button or get the history through dashboard API to verify.
To see the capacity forecast data from the ECS Portal, select Monitor > Capacity Utilization >
Capacity. Capacity tab is the default.
To see the data about total capacity, used capacity, and available capacity, click History.
Capacity Forecast is calculated based on the total capacity and used capacity.

Monitor capacity
You can use the Capacity tab to view capacity utilization data for:
l VDC (VDC capacity utilization on page 19)
l Storage Pools (Storage pool capacity utilization on page 20)
l Nodes (Node capacity utilization on page 21)
l Disks (Disk capacity utilization on page 21)
l Used Capacity (Monitor used capacity on page 22)
You can view summary storage usage data about total, used, available, and reserved storage
capacity for storage pools and nodes.
Reserved capacity is the approximately 10 percent of the total capacity that is reserved for failure
handling and for performing erasure encoding or XOR operations. Reserved capacity is not
available for writing new data.
The tab opens with the Storage Pools capacity table displayed. To view capacity data for individual
nodes, click the appropriate link in the Nodes (Online) column to display the Nodes table. Click
the appropriate link in the Disks (Online) column to view capacity data for individual disks.
You can display average values over a selected date-time range or over a custom time range using
the Filter drop-down menu. The Current filter displays the latest available values and is the
default filter value.
When the table has the Date Time Range filter set to Current (the default setting), the table
displays the latest values and the history graphs display values over the last 24-hour period. When
the table has a Date Time Range filter applied (other than Current), it displays the average value
over that period.
VDC capacity utilization

Table 2 Capacity utilization: VDC

Attribute Description

VDC Name of the VDC.

Per 1 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 1-day
usage trend.

Per 7 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 7-days
usage trend.

Per 30 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 30-days
usage trend.

Per 1 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 1-day
usage trend.

ECS Monitoring Guide 19


Monitoring ECS

Table 2 Capacity utilization: VDC (continued)

Attribute Description

Per 7 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 7-days
usage trend.

Per 30 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 30-days
usage trend.

Total Total capacity of the VDC that is online. This is the total of the capacity that
is already used and the capacity still free for allocation.

Used Used online capacity in the VDC.

Available (Reserved) Online capacity available for use, including the approximately 10% of the
Note: If the Current filter is total capacity that is reserved for failure handling and for performing erasure
applied, Available (Reserved) encoding or XOR operations.
displays. If a filter other than
Current is applied, only Available
displays.

Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.

Storage pool capacity utilization

Table 3 Capacity utilization: storage pool

Attribute Description

Storage Pool Name of the storage pool.

Nodes (Online) Number of nodes in the storage pool followed by the number of those nodes
online. Click this number to open: Node capacity utilization on page 21.

Online Nodes with Sufficient Disk Number of online nodes that have sufficient disk space to accept new data.
Space If too many disks are too full to accept new data, the performance of the
Note: Does not appear if a filter system may be impacted.
other than Current is applied.

Disks (Online) Number of disks in the storage pool followed by the number of those disks
that are online.

Total Total capacity of the storage pool that is online. This is the total of the
capacity that is already used and the capacity still free for allocation.

Used Used online capacity in the storage pool.

Available (Reserved) Online capacity available for use, including the approximately 10% of the
Note: If the Current filter is total capacity that is reserved for failure handling and for performing erasure
applied, Available (Reserved) encoding or XOR operations.
displays. If a filter other than
Current is applied, only Available
displays.

20 ECS Monitoring Guide


Monitoring ECS

Table 3 Capacity utilization: storage pool (continued)

Attribute Description

Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.

Node capacity utilization

Table 4 Capacity utilization: node

Attribute Description

Nodes Fully qualified domain name (FQDN) of the node.

Disks (Online) Number of disks that are associated with the node followed by the number
of those disks that are online. Click disk number to open: Disk capacity
utilization on page 21

Total Total online capacity provided by the online disks within the node. This is the
total of the capacity that is already used and the capacity still free for
allocation.

Used Online capacity used within the node.

Available (Reserved) Remaining online capacity available in the node including reserved capacity.
Note: If the Current filter is
applied, Available (Reserved)
displays. If a filter other than
Current is applied, only Available
displays.

Offline Total capacity of the node that is offline.


Note: Displays only if the
Current filter is applied.

Online Status Indicates whether the node is online or offline. A check mark indicates that
the node status is Good.

Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.

Disk capacity utilization

Table 5 Capacity utilization: disk

Attribute Description

Disks Disk identifier.

Total Total capacity provided by the disk.

Used Capacity used on the disk.

Available Remaining capacity available on the disk.

Online Status Indicates whether the disk is online or offline. The check mark indicates that
the disk status is Good.

ECS Monitoring Guide 21


Monitoring ECS

Table 5 Capacity utilization: disk (continued)

Attribute Description

Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.

Monitor used capacity


You can use the Used Capacity tab to view the used storage capacity for the current VDC and for
each storage pool in the VDC.

Table 6 Used capacity

Storage use Description

User Data The capacity that is used for the repository chunks representing data uploaded
by ECS users.

System Metadata The capacity that is used by the ECS processes that track and describe the data
in the system.

Protection Overhead The combined overhead of triple mirroring and erasure coding for all user data,
system metadata, and geo data protection chunks protected locally.

Geo Cache The capacity used to cache chunks that are accessed locally but not stored
locally.

Geo Copy The capacity that is used for Geo-replication chunks stored on the current VDC.

Garbage The capacity used by data that is no longer in use.

Storage usage is shown as color-coded bars, one color for the current VDC, and a different color
for its storage pools. Tool tips for each colored bar correspond to the status information in the
numeric status line.

Monitor garbage collection data


You can use the Garbage Collection tab to monitor garbage collection data for the entire VDC or
for individual storage pools. Use the Virtual Data Center drop-down menu to select the storage
type: Virtual Data Center or Storage Pool. Virtual Data Center is the default.
Garbage collection is enabled by default at installation. Contact your customer support
representative to disable or reenable this feature.
The Garbage Collection page has the following subtabs:
l Garbage Detected: View summary garbage collection data.
l Capacity Reclaimed: View data about storage capacity reclaimed by the garbage collection
process.
Garbage Detected
Click the Virtual Data Center drop-down menu to view garbage detection data for the entire VDC
or individual storage pools.

22 ECS Monitoring Guide


Monitoring ECS

Table 7 Garbage collection: garbage detected

Attribute Description

Storage Type The VDC or storage pool for which to view garbage collection data.

Total Garbage Detected The amount of reclaimable storage capacity detected on the system.

Capacity Reclaimed The amount of storage capacity reclaimed by the garbage collection
process.

Capacity Pending Reclamation The amount of storage capacity that is identified as reclaimable but not
reclaimed yet.

UnReclaimable Garbage The amount of storage capacity that cannot be reclaimed currently.

Capacity Reclaimed
Click the Filter button to set a filter for the reclamation data by VDC or storage pool over a date/
time range.

Table 8 Garbage collection: capacity reclaimed

Attribute Description

Storage Type The VDC or storage pool for which to view capacity reclaimed data.

Capacity Reclaimed The amount of storage capacity recovered following garbage collection.

User Data Reclaimed The amount of user data recovered.

System Metadata Reclaimed The amount of system metadata recovered.

Actions History provides a graphic display of the data. If the Current filter
(default) is selected, the History button displays the total reclaimed
capacity for the last 24 hours. History data is kept for 60 days.

Monitor erasure encoding


You can use the Erasure Encoding tab to monitor the total user data and erasure encoded data in
a local storage pool. It also shows the current encoding rate and the estimated completion time.
You can display average values over a selected date-time range or over a custom time range using
the Filter drop-down menu. The Current filter displays the latest available values and is the
default filter value.

Table 9 Erasure encoding metrics

Column Description

Storage Pool The storage pools from the current VDC.

Total Coding Data The total logical size of all data chunks in the storage pool which are subject
to erasure encoding.

Total Coded Data The total logical size of all erasure-encoded chunks in the storage pool.

Coded (%) The percent of data in the storage pool that is erasure encoded. Percent
values display with three decimal places in the history chart for accurate

ECS Monitoring Guide 23


Monitoring ECS

Table 9 Erasure encoding metrics (continued)

Column Description

plotting. Percent values display with two decimal points in the table,
consistent with the format of the other values in the table.

Coding Rate The rate at which any current data waiting for erasure encoding is being
processed.

Est. Time to Complete The estimated completion time extrapolated from the current erasure
encoding rate.

Actions l History provides a graphic display of the total coding data, total coded
data, percent of data coded, and coding rate per second. History data is
kept for 60 days.
l If the Current filter is selected, History displays default history for the
last 24 hours.

Monitor CAS processing


You can use the CAS Processing tab to monitor unused CAS (Content Addressable Storage)
objects in CAS buckets within a selected namespace over a specified time range. The unused CAS
objects that are monitored by ECS include unreferenced blobs and expired reflections.
In Centera terminology, there are three types of CAS objects: blob, clip, and reflection.
l Blob: CAS data objects are called blobs (binary large objects). Blobs store data. Blobs can be
referenced by data objects of a different type called clips. A blob is referenced by its Content
Address (CA) that is stored in the Content Description File (CDF) that references the blob.
The logical combination of a CDF and a Blob is called a Clip. The hash of a CDF is the Clip-ID.
There can be multiple Clips for the same Blob with different CDFs (different metadata but with
same user data, single instance storage). When blobs are not referenced by live clips, these
unreferenced blobs become garbage data.
l C-Clip: Combination of a CDF and its related blobs
l Reflection: CDF of a deleted C-Clip. A reflection is created after the deletion of a C-Clip and
provides an audit trail for each deleted C-Clip. Reflections may have expiration times. (If there
is no configured expiration time for a reflection, the reflection is never deleted.)
Click the Filter drop-down menu to select a namespace containing CAS buckets and to set a date/
time range to view the number and size of unreferenced blobs and expired reflections in CAS
buckets.
Important: For ECS systems with existing CAS data that upgrade to 3.2.1, there is a CAS garbage
data bootstrap process that is automatically triggered post upgrade. The bootstrap process builds
necessary references over the existing CAS data and can require a significant amount of time
depending on the amount of existing CAS data. During the bootstrap process, the unreferenced
blob and reflection values will not change on the CAS Processing page. For example, you see zero
for the unreferenced blob data that are detected and unreferenced blobs detected values. The
values will not change until after the bootstrap process is complete. If you see that the values do
not change over an extended period, call customer support.
When you search for a namespace (using the Search... option at the bottom of the list of
namespaces in the Namespace drop-down field), the search functionality is based on prefixes
only. For example, a search for fin returns finance-namespace-dev, while a search for dev
would return nothing.

24 ECS Monitoring Guide


Monitoring ECS

Table 10 CAS processing metrics

Attribute Description

Bucket The name of the bucket containing CAS data.

Unreferenced Blob Data Detected The amount of unreferenced blob data in the bucket (in bytes).

Unreferenced Blobs Detected The number of unreferenced blobs in the bucket.

Reflection Data Detected The amount of reflection data in the bucket (in bytes).

Reflections Expired The number of expired reflections in the bucket.

Actions History provides a graphic display of the unreferenced blob and reflection
data. If the Current filter (default) is selected, the History button displays
the data for the last 24 hours. History data is kept for 60 days.

Monitor system health


You can monitor system health from the ECS Portal Monitor > System Health page.
The System Health page has the following tabs:
l Hardware Health: View data about the status of nodes and disks.
l Process Health: View data about the status of the NIC, CPU, and memory.
l Node Rebalancing: View data about the status of node rebalancing operations.

Monitor hardware health


You can use the Hardware Health tab to obtain the health of disks and nodes.
About this task
The Hardware Health tab is accessed from the ECS Portal at Monitor > System Health >
Hardware Health. The following states describe hardware health:
l Good: The node is in normal operating condition.
l Suspect: Either the node is transitioning from good to bad because of decreasing hardware
metrics, or there is a problem with a lower-level hardware component, or the hardware is not
detectable by the system because of connectivity problems.
l Bad: The node needs replacement.
Disks states have the following meanings:
l Good: The system is reading from and writing to the disk.
l Suspect: The system no longer writes to the disk but reads from it. Swarms of suspect disks
are likely caused by connectivity problems at a node. These disks transition back to Good when
the connectivity issues clear up.
l Bad: The system neither reads from nor writes to the disk. Replace the disk. Once a disk has
been identified as bad by the ECS system, it cannot be reused anywhere in the ECS system.
Because of ECS data protection, when a disk fails, copies of the data that was once on the disk
are re-created on other disks in the system. A bad disk only represents a loss of capacity to the
system--not a loss of data. When the disk is replaced, the new disk does not have data that is
restored to it. It becomes raw capacity for the system.
l Missing: The disk is a known disk that is unreachable. The disk may be transitioning between
states, disconnected, or pulled.

ECS Monitoring Guide 25


Monitoring ECS

l Removed: The disk is one that the system has completed recovery on and removed from the
storage engine's list of valid disks. History of all the removed disks will be displayed on ECS UI.
l Not Accessible: If a node is not accessible, then all its disks have this status. It indicates that
the actual status of the disk is not available to ECS.
Note: The Current filter displays the latest available values. A date-time range filter displays
average values over the specified range. Value data is kept for 60 days.
Procedure
1. Select Monitor > System Health and select the Hardware Health tab.
By default the Offline Nodes subtab displays. This table may be empty if all nodes are
online. Similarly, the Offline Data Disks subtab may be empty if all disks are online.
2. Select the Offline Nodes and Offline Data Disks subtabs to view a summary.
3. Select the All Nodes and Data Disks subtab to drill down to nodes and disks.
4. Click the node name to drill down to its disk health page.
Note: The Slot Info value always matches the physical slot ID in ECS U-Series, C-
Series, and D-Series Appliances. This makes Slot Info useful for quickly locating a disk
during disk replacement service. Some Certified Hardware installations with ECS
Software may not report useful or reliable data for Slot Info.
Note: Monitor the health of online and offline storage pool nodes and data disks. All data
disks that belong to the selected node are listed here. SSD Read Caches are not
included.

Monitor process health


You can use the Process Health tab to obtain metrics that can help assess the health of the VDC,
node, or node process.
About this task
The Process Health tab is accessed from the ECS Portal at Monitor > System Health > Process
Health.
Note: When clicked Process Health, the Process Health - Overview dashboard opens in a
new Grafana window.
Process Health dashboards can also be accessed from Advanced Monitoring > expand Data
Access Performance - Overview
l > Process Health - by Nodes
l > Process Health - Overview
l > Process Health - Process List by Node
.

Table 11 VDC, node, and process health metrics

Metric label Level Description

Avg. NIC Bandwidth VDC and Node Average bandwidth of the network interface
controller hardware that is used by the selected VDC
or node.

Avg. CPU Usage (%) VDC and Node Average percentage of the CPU hardware that is
used by the selected VDC or node.

26 ECS Monitoring Guide


Monitoring ECS

Table 11 VDC, node, and process health metrics (continued)

Metric label Level Description

Avg. Memory Usage VDC and Node Average usage of the aggregate memory available to
the VDC or node.

Relative NIC (%) VDC and Node Percentage of the available bandwidth of the network
interface controller hardware that is used by the
selected VDC or node.

Relative Memory (%) VDC and Node Percentage of the memory used relative to the
memory available to the selected VDC or node.

CPU Usage Process Percentage of the node's CPU used by the process.
The list of processes that are tracked is not the
complete list of processes running on the node. The
sum of the CPU used by the processes is not equal to
the CPU usage shown for the node.

Memory Usage Process The memory used by the process.

Relative Memory (%) Process Percentage of the memory used relative to the
memory available to the process.

Avg. # Thread Process Average number of threads used by the process.

Last Restart Process The last time the process restarted on the node.

Table 12 ECS processes

Process Description

Blob Service (blobsvc) Manages the following tables: Object (OB), Listing (LS), and Repo
Chunk Reference (RR).

Chunk Manager (cm) Manages the following tables: Chunk (CT), Btree Reference (BR).
Provides the logic to handle various events based on the chunk's
current state and decide which state to transition to next.

Directory Table Query (dtquery) Provides REST APIs to get Directory Table (DT) details.

GeoReceiver (georeceiver) Receives requests for chunks in the current VDC that are not owned
by the current VDC (secondary chunks). It then requests Chunk
Manager to start an operation to track the copy chunk creation and
select three replicas. The GeoReceiver process then writes the
datastream to the three instances. On successful completion, it directs
Chunk Manager to commit the copy chunk.

Head Service (headsvc) Manages object head protocols: S3, OpenStack Swift, EMC Atmos,
CAS, and HDFS.

Metering (metering) Manages the following tables: Metering Aggregate (MA) and Metering
Raw (MR).

Object Control Service (objcontrolsvc) Provides REST APIs for configuring the ECS cluster, managing ECS
resources, and monitoring the system.

Provision Service (provisionsvc) Manages the provisioning of storage resources and user access. It
handles user management, authorization, and authentication for all

ECS Monitoring Guide 27


Monitoring ECS

Table 12 ECS processes (continued)

Process Description

provisioning requests, resource management, and multi-tenancy


support.

Resource Service (resourcesvc) Manages the following tables: Resource Table (RT) which handles
replication groups, buckets, users, namespace information and so on.

Record Manager (rm) Manages PR (Partition Record) table (journal region).

Storage Service Manager (ssm) Manages the following tables: Storage Space (SS) which contain disk
block usage and disk to chunk mapping. Interacts with one or more
Storage Servers and manages the active/free chunks on the
corresponding servers. Directs I/O operations to the disks.

Statistics Service (statsvc) Tracks various information on storage processes. These statistics can
be used to monitor the system.

VNest (vnest) Provides distributed synchronization and group services. A subset of


data nodes will be group members responsible for serving the key/
value requests. VNest services running on other nodes will listen for
configuration updates and be ready to be added to the group.

See Advanced Monitoring, Process Health - by Nodes, Process Health - Overview and Process
Health - Process List by Nodefor details.

Monitor node rebalancing status


Use the Node Rebalancing tab to monitor the status of data rebalancing operations when nodes
are added to, or removed from, a cluster. Node rebalancing is enabled by default at installation.
Contact your customer support representative to disable or re-enable this feature.
Before you begin
Access the Node Rebalancing tab from the ECS Portal at Monitor > System Health > Node
Rebalancing.
Note: When clicked Node Rebalancing, the Node Rebalancing dashboard opens in a new
Grafana window.
The Node Rebalancing dashboard can also be accessed from Advanced Monitoring > expand
Data Access Performance - Overview > Node Rebalancing.
See Advanced Monitoring and Node Rebalancing for details.

Monitor transactions
You can monitor requests and network performance for VDCs and nodes from the Monitor >
Transactions page.
Access the Transactions tab from the ECS Portal at Monitor > Transactions.
Note: When clicked Transactions, the Data Access Performance - Overview dashboard
opens in a new Grafana window.
The Transactions data can also be accessed from Advanced Monitoring > Data Access
Performance - Overview.
See Advanced Monitoring and Data Access Performance - Overview for details.

28 ECS Monitoring Guide


Monitoring ECS

Monitor recovery status


You can use the Recovery Status page to monitor the data recovered by the system.
About this task
Recovery is the process of rebuilding data after any local condition that results in bad data
(chunks). The Recovery Status page is accessed from the ECS Portal at Monitor > Recovery
Status.
Note: When clicked Recovery Status, the Recovery Status dashboard opens in a new
Grafana window.
The Recovery Status dashboard can also be accessed from Advanced Monitoring > expand Data
Access Performance - Overview > Recovery Status.
See Advanced Monitoring for details.

Monitor disk bandwidth


You can use the Disk Bandwidth page to monitor the disk usage metrics at the VDC or individual
node level.
About this task
The Disk Bandwidth page is accessed from the ECS Portal at Monitor > Disk Bandwidth.
Note: When clicked Disk Bandwidth, the Disk Bandwidth - Overview dashboard opens in a
new Grafana window.
Disk Bandwidth dashboards can also be accessed from Advanced Monitoring > expand Data
Access Performance - Overview
l > Disk Bandwidth - by Nodes
l > Disk Bandwidth - Overview.
See Advanced Monitoring, Disk Bandwidth - by Nodes and Disk Bandwidth - Overview for details.

Introduction to geo-replication monitoring


You can use the Geo Replication page to monitor the replication of data across the VDCs that
make up a replication group.
The Geo Replication page is accessed from the ECS Portal at Monitor > Geo Replication and
provides four tabs:
l Rate and Chunks
l Recovery Point Objective (RPO)
l Failover Processing
l Bootstrap Processing

Monitor geo replication: Rate and Chunks


You can use the Rate and Chunks tab to obtain metrics about the network traffic for geo-
replication and the chunks waiting for replication by a replication group or remote VDC.
The Rate and Chunks tab is accessed from the ECS Portal at Monitor > Geo Replication > Rate
and Chunks.

ECS Monitoring Guide 29


Monitoring ECS

Table 13 Rate and Chunks columns

Column Description

Replication Group Lists the replication groups of which this VDC is a member. Click a
replication group to see a table of remote VDCs in the replication
group and their statistics. Click the Replication Groups link above the
table to return to the default view.

Write Traffic The current rate of writes to all remote VDCs or individual remote VDC
in the replication group.

Read Traffic The current rate of reads to all remote VDCs or individual remote VDC
in the replication group.

User Data Pending Replication The total logical size of user data waiting for replication for the
replication group or remote VDC.

Metadata Pending Replication The total logical size of metadata waiting for replication for the
replication group or remote VDC.

Data Pending XOR The total logical size of all data waiting to be processed by the XOR
compression algorithm in the local VDC for the replication group or
remote VDC.

Monitor geo replication: Recovery Point Objective (RPO)


You can use the RPO tab to view the recovery point objective for a replication group and its
remote VDCs. The RPO refers to the point in time in the past to which you can recover. The value
presented is the oldest data at risk of being lost if a local VDC fails before replication is complete.
The RPO tab is accessed from the ECS Portal at Monitor > Geo Replication > RPO.

Table 14 RPO columns

Column Description

Remote Replication Group\Remote VDC At the VDC level, lists all remote replication groups of which the local
VDC is a member. At the replication group level, this column lists the
remote VDCs in the replication group.

Overall RPO The recent time period for which data might be lost in the event of a
local VDC failure.

Monitor geo replication: Failover Processing


You can use the Failover Processing tab to view the metrics on the process to rereplicate data
following permanent failure of a remote VDC.
The Failover Processing tab is accessed from the ECS Portal at Monitor > Geo Replication >
Failover Processing.

Table 15 Failover columns

Field Description

Replication Group Lists the replication groups that the local VDC is a member of.

Failed VDC Identifies a failed VDC that is part of the replication group.

30 ECS Monitoring Guide


Monitoring ECS

Table 15 Failover columns (continued)

Field Description

User Data Pending Re-replication When a VDC fails, user data chunks replicated to the failed VDC have
to be re-replicated to a different VDC. This field reports the logical
size of all user data (repository) chunks waiting re-replication to a
different VDC.

Metadata Pending Re-replication When a VDC fails, metadata chunks replicated to the failed VDC have
to be re-replicated to a different VDC. This field reports the logical
size of all metadata chunks waiting re-replication to a different VDC.

Data Pending XOR Decoding Shows the count and total logical size of chunks waiting to be
retrieved by the XOR compression scheme.

Failover State l BLIND_REPLAY_DONE


l REPLICATION_CHECK_DONE: The process that makes sure that
all replication chunks are in an acceptable state and replication has
completed successfully.
l CONSISTENCY_CHECK_DONE: The process that makes sure
that all system metadata is fully consistent with other replicated
data and has completed successfully.
l ZONE_SYNC_DONE: The synchronization of the failed VDC has
completed successfully.
l ZONE_BOOTSTRAP_DONE: The bootstrap process on the failed
VDC has completed successfully.
l ZONE_FAILOVER_DONE: The failover process has completed
successfully.

Failover Progress A percentage indicator for the overall status of the failover process.

Monitor geo replication: Bootstrap Processing


You can use the Bootstrap Processing tab to monitor the copying of user data and metadata to a
VDC that has been added to a replication group.
The Bootstrap Processing tab is accessed from the ECS Portal at Monitor > Geo Replication >
Bootstrap Processing.

Table 16 Bootstrap Processing columns

Column Description

Replication Group This column provides the list of replication groups of which the local
VDC is a member and that are adding new VDCs. Each row provides
metrics for the specified replication group.

Added VDC The VDC being added to the specified replication group.

User Data Pending Replication The logical size of all user data (repository) chunks waiting for
replication to the new VDC.

Metadata Pending Replication The logical size of all system metadata waiting for replication to the
new VDC.

ECS Monitoring Guide 31


Monitoring ECS

Table 16 Bootstrap Processing columns (continued)

Column Description

Bootstrap State The bootstrap state. Can be:


l BTreeScan
l ReplicateBTree
l ReplicateBTreeMarker
l ReplicateJournal
l Done

Bootstrap Progress (%) The completion percent of the entire bootstrap process.

Cloud hosted VDC monitoring


ECS provides support for identifying when a site is hosted or on-premise and the ECS
Management REST API provides support for retrieving information about the utilization and
performance of hosted sites.
Where an ECS system includes a hosted site, the ECS Portal displays a top-level Cloud menu that
enables administrators to see how the hosted sites are used as part of replication groups and to
view the traffic to and from the hosted site in terms of bandwidth utilization and latency. The
portal displays also show the traffic to and from on-premise sites to allow comparison with hosted
sites traffic.
The Cloud menu is not shown if the ECS system uses only on-premise sites.

Cloud topology
You can use the Cloud topology summary information to see how the ECS system is making use of
hosted VDCs.
The Cloud > Topology page shows the hosted VDCs that are part of an ECS federated system,
and shows the relationship between the hosted VDC and any on-premise VDCs.
Cloud Hosted VDCs
The Cloud Hosted VDCs table shows the hosted VDCs that are present in the ECS system.
Currently ECS supports a single hosted site.
Related On-Premise VDCs
The Related On-Premise VDCs table shows the on-premise VDCs that are part of the ECS
federation.
Related Replication Groups
The Related Replication Groups table shows the replication groups that contain a storage pool
contributed by a selected hosted VDC. The Hosted VDC is selected in the Cloud Hosted VDC table.
A primary use case for using a hosted VDC is the Passive configuration in which the hosted VDC
provides a site for replication data but cannot be used as an active site by users. However, where
the active operation of the hosted VDC is allowed, the hosted VDC can be included in replication
groups where the type is Passive.
The table shows the replication group type and the VDC storage pools that are contributing to the
replication group, at least one of which will be a hosted VDC.

32 ECS Monitoring Guide


Monitoring ECS

Cloud replication traffic


You can use the cloud replication traffic information is to see the performance of hosted VDCs and
compare with on-premise VDCs.
The Cloud > Replication page shows replication traffic by VDC and by replication group.
Note: The Current filter displays the latest available values. A date-time range filter displays
average values over the specified range.
Virtual Data Centers
The Virtual Data Centers tab shows each VDC, both hosted or on-premise, and provides
aggregated traffic figures for all replication groups associated with a VDC.

Table 17 Replication traffic by VDC

Attribute Description

Read Latency The average latency in milliseconds for reads from all replication groups
associated with the selected VDC.

Write Latency The average latency in milliseconds for writes to all replication groups
associated with the selected VDC.

Read Bandwidth The bandwidth utilized by reads from all replication groups associated with
the selected VDC.

Write Bandwidth The bandwidth utilized by writes from all replication groups associated with
the selected VDC.

Replication Groups
The Replication Groups tab shows each replication group and provides traffic data for a VDC for
each replication group that it contributes to. A VDC might have a storage pool that is in more than
one replication group, and this display allows you to see the traffic associated with each replication
group.

Table 18 Replication traffic by replication group

Attribute Description

Read Latency The average latency in milliseconds for reads from the selected VDC that
relate to the specified replication group.

Write Latency The average latency in milliseconds for writes to the selected VDC that
relate to the specified replication group.

Read Bandwidth The bandwidth utilized by reads from the from the selected VDC that relate
to the specified replication group.

Write Bandwidth The bandwidth utilized by writes to the selected VDC that relate to the
specified replication group.

ECS Monitoring Guide 33


Monitoring ECS

34 ECS Monitoring Guide


CHAPTER 3
Monitoring Events: Audits and Alerts

l About event monitoring......................................................................................................... 36


l Monitor audit data.................................................................................................................36
l Audit messages..................................................................................................................... 36
l Monitor alerts........................................................................................................................ 41
l Alert policy............................................................................................................................ 42
l Acknowledge all alerts........................................................................................................... 44
l Alert messages...................................................................................................................... 44

ECS Monitoring Guide 35


Monitoring Events: Audits and Alerts

About event monitoring


You can view the available event monitoring messages (audit and alert) from the ECS Portal.
The Monitor > Events page has two tabs:
l Audit: All activity by users working with the portal, the ECS REST APIs, and the ECS CLI.
Other audit types include upgrade activities.
l Alerts: Alerts raised by the ECS system.
Event data through the ECS Portal is limited to 30 days. If you need to keep event data for longer
periods, consider using the ViPR SRM product.

Monitor audit data


Use the Monitor > Events > Audit tab to view and manage audit data.
About this task
See the List of audit messages.
Procedure
1. Select the Audit tab.
2. Optionally, select Filter.
3. Specify a Date Time Range and adjust the From and To fields and time fields. When
creating a custom date-time range, select Current Time to use the current date and time as
the end of your range.
4. Select a Namespace.
5. Click Apply.
Note: The newest audit messages appear at the top of the table.

Audit messages
List of the audit messages generated by ECS.

Table 19 ECS audit messages

Service Audit item Audit message

Alert sent_alert Alert \"${alertMessage}\" with symptom code $


{symptomCode} triggered

Auth Provider new_authentication_provider_added New authentication provider ${resourceId} added

Auth Provider authentication_provider_deleted Authentication provider ${resourceId} deleted

Auth Provider authentication_provider_updated Existing Authentication provider ${resourceId} updated

Bucket bucket_created Bucket ${resourceId} has been created

Bucket bucket_deleted Bucket ${resourceId} has been deleted

Bucket bucket_updated Bucket ${resourceId} has been updated

Bucket bucket_ACL_set Bucket ${resourceId} ACLs have changed

36 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 19 ECS audit messages (continued)

Service Audit item Audit message

Bucket bucket_owner_changed Owner of ${resourceId} bucket has changed

Bucket bucket_versioning_set Versioning has been enabled on ${resourceId} bucket

Bucket bucket_versioning_unset Versioning has been suspended on ${resourceId} bucket

Bucket bucket_versioning_source_set Bucket ${resourceId} versioning source set

Bucket bucket_metadata_set Metadata on ${resourceId} bucket has been changed

Bucket bucket_head_metadata_set Bucket ${resourceId} head metadata set

Bucket bucket_expiration_policy_set Bucket ${resourceId} expiration policy has updated

Bucket bucket_expiration_policy_deleted Bucket ${resourceId} expiration policy has been deleted

Bucket bucket_cors_config_set Bucket ${resourceId} CORS rules have been changed

Bucket bucket_cors_config_deleted Bucket ${resourceId} CORS rules have been deleted

Bucket notification_size_exceeded_on_bucket Notification size has been exceeded on ${resourceId}


bucket

Bucket block_size_exceeded_on_bucket Block size has been exceeded on ${resourceId} bucket

Bucket bucket_set_quota Bucket ${resourceId} quota has been updated with


notification size as ${notificationSize} and block size as
${blockSize}

Bucket bucket_policy_created Bucket ${resourceId} policy has been created

Bucket bucket_policy_updated Bucket ${resourceId} policy has been updated

Bucket bucket_policy_deleted Bucket ${resourceId} policy has been deleted

Cluster cluster_set Cluster id ${resourceId} has been set

Fabric InstallerServiceOperation[kind=
INSTALLER_SERVICE_
OPERATION,
host=${hostName},
timestamp=${timestamp},
operationType=${operation},
args=${arguments of operation},
status=SUCCEEDED,
fqdn=${fqdn of host},
version=${installer version}]

Fabric NodeMaintenanceMode[kind=
NodeMaintenanceMode,
timestamp=${timestamp},
agentId=${agendId},
fqdn=${fqdn},
status=${MaintenanceStatus}]

License user_added_license License ${resourceId} has been added

License managed_capacity_exceeded Managed capacity has exceeded licensed ${resourceId}


capacity

ECS Monitoring Guide 37


Monitoring Events: Audits and Alerts

Table 19 ECS audit messages (continued)

Service Audit item Audit message

License license_expired License ${resourceId} has expired

Local user domain_group_mapping_created Domain group ${resourceId} to ${roles} role(s) mapping


is added

Local user domain_group_mapping_created_no_role Domain group ${resourceId} without role mappings is


s added

Local user domain_group_mapping_updated Domain group ${resourceId} roles mapping is changed


to ${roles} role(s)

Local user domain_group_mapping_updated_no_rol All roles of domain group ${resourceId} mapping have
es been removed

Local user domain_user_mapping_created Domain user ${resourceId} to ${roles} role(s) mapping


is added

Local user domain_user_mapping_created_no_roles Domain user ${resourceId} without role mappings is


added

Local user domain_user_mapping_deleted Domain user ${resourceId} mapping is removed

Local user domain_user_mapping_updated Domain user ${resourceId} role mapping is changed to $


{roles} role(s)

Local user domain_user_mapping_updated_no_role All roles of domain user ${resourceId} mapping have
s been removed

Local user local_user_created Management user ${resourceId} with ${roles}


role(s)has been created

Local user local_user_created_no_roles Management user ${resourceId} without roles has been
created

Local user local_user_deleted Management user ${resourceId} has been deleted

Local user local_user_password_changed Credential of management user ${resourceId} has


changed

Local user local_user_updated Roles of management user ${resourceId} have been


changed to ${roles}

Local user local_user_roles_updated_no_roles All roles of management user ${resourceId} have been
removed

Locked vdc_lock_successful VDC lock was successful

Locked vdc_lock_failed VDC lock failed

Locked node_lock_successful Lock successful for node ${resourceId}

Locked node_lock_failed Lock failed for node ${resourceId}

Locked node_unlock_successful Unlock successful for node ${resourceId}

Locked node_unlock_failed Unlock failed for node ${resourceId}

Login login_successful User ${resourceId} logged in successfully

Login login_failed User ${resourceId} failed to login

Login user_token_logout User logged out token ${resourceId}

38 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 19 ECS audit messages (continued)

Service Audit item Audit message

Login user_logout All user tokens have logged out

Namespace block_size_exceeded_on_namespace Block size has been exceeded on ${resourceId}


namespace

Namespace namespace_admin_group_mappings_upd Namespace ${resourceId} admin group mappings


ated updated to following groups: ${groups}

Namespace namespace_admin_group_mappings_upd Namespace ${resourceId} admin groups mappings


ated_no_groups updated to an empty list

Namespace namespace_admin_user_mappings_upda Namespace ${resourceId} admin mappings updated to


ted following users: ${admins}

Namespace namespace_admin_user_mappings_upda Namespace ${resourceId} admin mappings updated to


ted_no_admins an empty list

Namespace namespace_created Namespace ${resourceId} has been created

Namespace namespace_deleted Namespace ${resourceId} has been deleted

Namespace namespace_updated Namespace ${resourceId} has been updated

Namespace notification_size_exceeded_on_namespa Notification size has been exceeded on ${resourceId}


ce namespace

NFS ugmapping_created ${type} mapping ${ugMappingName} --> ${resourceId}


has been created

NFS ugmapping_deleted ${type} mapping ${ugMappingName} --> ${resourceId}


has been deleted

NFS export_created Export with export path ${exportPath} has been


created

NFS export_deleted Export with export path ${exportPath} has been deleted

NFS export_updated Export with export path ${exportPath} has been


updated

Replication replication_group_created Replication Group ${resourceId} has been created


Group

Replication replication_group_updated Replication Group ${resourceId} has been updated


Group

Security command_exec_insufficient_permission Attempt to execute a command ${command} from $


{host} without right permissions

SNMP snmp_v2_target_created SNMP target ${snmpTarget} with Community '$


{community}' is added

SNMP snmp_v3_target_created SNMP target ${snmpTarget} with Username '$


{username}', Authentication(${authProtocol}) and
Privacy(${privProtocol})

SNMP snmp_target_deleted SNMP target ${snmpTarget} is deleted

SNMP snmp_engineid_updated SNMP agent EngineID is set to ${engineId}

ECS Monitoring Guide 39


Monitoring Events: Audits and Alerts

Table 19 ECS audit messages (continued)

Service Audit item Audit message

SNMP snmp_v2_target_updated SNMP target ${oldSnmpTarget} is updated as $


{newSnmpTarget} with Community string $
{community}

SNMP snmp_v3_target_updated SNMP target ${oldSnmpTarget} is updated as $


{newSnmpTarget} with Username ${username},
Authentication(${authProtocol}) and Privacy($
{privProtocol})

Storage Pool storage_pool_created Storage Pool ${resourceId} has been created

Storage Pool storage_pool_deleted Storage Pool ${resourceId} has been deleted

Storage Pool storage_pool_updated Storage Pool ${resourceId} has been updated

Syslog syslog_server_added Syslog server ${protocol}://${host}:${port} with


severity ${severity} is added into the configuration

Syslog syslog_server_updated Syslog server ${old_protocol}://${old_host}:$


{old_port} is updated to ${protocol}://${host}:${port}
with severity ${severity} in the configuration

Syslog syslog_server_deleted Syslog server ${protocol}://${host}:${port} is removed


from the configuration

Transformatio transformation_created_message Transformation created


n

Transformatio transformation_updated_message Transformation updated


n

Transformatio transformation_pre_check_started_mess Transformation precheck started


n age

Transformatio transformation_enumeration_started_me Transformation enumeration started


n ssage

Transformatio transformation_indexing_started_messag Transformation indexing started


n e

Transformatio transformation_migration_started_messa Transformation migration started


n ge

Transformatio transformation_recovery_migration_start Transformation recovery migration started


n ed_message

Transformatio transformation_reconciliation_started_m Transformation reconciliation started


n essage

Transformatio transformation_sources_updated_messa Transformation sources updated


n ge

Transformatio transformation_deleted_message Transformation deleted


n

Transformatio transformation_retried_message Transformation %s retried


n

Transformatio transformation_canceled_message Transformation %s canceled


n

40 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 19 ECS audit messages (continued)

Service Audit item Audit message

Transformatio transformation_profile_mappings_update Transformation profile mappings updated


n d_message

User change_password_failed User ${resourceId} failed to change password, reason: $


{reason}

User user_created Object user ${resourceId} has been created

User user_deleted Object user ${resourceId} has been deleted

User user_set_password New password has been set for object user $
{resourceId}

User user_delete_password Password has been deleted for object user $


{resourceId}

User user_set_metadata New metadata has been set for object user $
{resourceId}

User user_locked Object user ${resourceId} has been locked

User user_unlocked Object user ${resourceId} has been unlocked

User user_set_user_tag User Tag has been set for object user ${resourceId}

User user_delete_user_tag User Tag has been deleted for object user ${resourceId}

Monitor alerts
You can use the Monitor > Events > Alerts tab to view and manage system alerts.
About this task
See the list of alert messages.
Alert message Severity labels have the following meanings:
l Critical: Messages about conditions that require immediate attention
l Error: Messages about error conditions that report either a physical failure or a software failure
l Warning: Messages about less than optimal conditions
l Info: Routine status messages
Procedure
1. Select Alerts.
2. Optionally, click Filter.
3. Select your filters. The alerts filter adds filtering by Severity and Type, and an option to
Show Acknowledged Alerts, which retains the display of an alert even after it is
acknowledged by the user. When creating a custom date-time range, select Current Time
to use the current date and time as the end of your range.
Alert types must be entered exactly as described in the following table:

ECS Monitoring Guide 41


Monitoring Events: Audits and Alerts

Table 20 Alert types

Alert Type (type exactly as Description


shown)

Fabric Raised when system issues detected.

Geo Raised for geo-replication alerts.

License Raised for license, capacity, or capacity entitlement exceeded alerts.

Notify Raised for miscellaneous alerts.

Quota Raised when soft or hard quota limits are exceeded (SoftQuotaLimitExceeded or
HardQuotaLimitExceeded) for a bucket or for a namespace.

RPO Raised when the recovery point objective (RPO) is greater than the RPO threshold.

Capacity Alerting Raised when the remaining capacity of the storage pool reaches a set threshold.

Capacity License Threshold Raised if the system capacity is greater than the licensed capacity.

CHUNK_NOT_FOUND Raised when chunk data is not found.

DTSTATUS_RECENT_FAILURE Raised when the status of a data table is bad.

Table 21 ESRS dial home types

Alert Type (type exactly as Description


shown)

TestDialHome Raised to test that ESRS connections can be established and that the call home
functionality works.

4. Select a Namespace.
5. Click Apply.
6. Next to each event, click the Acknowledge Alert button to acknowledge and dismiss the
message. Messages that have previously been acknowledged will display when the Show
Acknowledged Alerts filter option is selected, but the Acknowledge Alert button will not
be displayed for these rows.
7. You can click the Description of an alert, when it is formatted as a link, to be taken to a
relevant page in the portal.

Alert policy
Alert policies are created to alert about metrics, and are triggered when the specified conditions
are met. Alert policies are created per VDC.
You can use the Settings > Alerts Policy page to view alert policies.
There are two types of alert policy:
System alert policies
l System alert policies are precreated and exist in ECS during deployment.
l All the metrics have an associated system alert policy.
l System alert policies cannot be updated or deleted.
l System alert policies can be enabled/disabled.

42 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

l Alert is sent to the UI and all channels (SNMP, SYSLOG, and Secure Remote Services).

User-defined alert policies


l You can create User-defined alert policies for the required metrics.
l Alert is sent to the UI and customer channels (SNMP and SYSLOG).

New alert policy


You can use the Settings > Alerts Policy > New Alert Policy tab to create user-defined alert
policies.
Procedure
1. Select New Alert Policy.
2. Give a unique policy name.
3. Use the metric type drop-down menu to select a metric type.
Metric Type is a grouping of statistics. It consists of:
l Btree Statistics
l CAS GC Statistics
l Geo Replication Statistics
l Metering Statistics
l Garbage Collection Statistics
l EKM

4. Use the metric name drop-down menu to select a metric name.


5. Select level.
a. To inspect metrics at the node level, select Node.
b. To inspect metrics at the VDC level, select VDC.
6. Select polling interval.
Polling Interval determines how frequently data should be checked. Each polling interval
gives one data point which is compared against the specified condition and when the
condition is met, alert is triggered.
7. Select instances.
Instances describe how many data points to check and how many should match the
specified conditions to trigger an alert.
For metrics where historical data is not available only the latest data is used.

8. Select conditions.
You can set the threshold values and alert type with Conditions.
The alerts can be either a Warning Alert, Error Alert, or Critical Alert.

9. To add more conditions with multiple thresholds and with different alert levels, select Add
Condition.
10. Click Save.

ECS Monitoring Guide 43


Monitoring Events: Audits and Alerts

Acknowledge all alerts


Alerts can be acknowledged individually or by bulk using the Acknowledge All Alerts button. You
can choose to acknowledge all the alerts or acknowledge a subset of the alerts using filters.
About this task
You can use the Monitor > Events > Alerts tab to acknowledge alerts.
Procedure
1. To acknowledge all alerts, click the Acknowledge All Alerts button.
a. To acknowledge a subset of all alerts, use the table filter to filter by a combination of
date and time, severity, type, or namespace, and then click Acknowledge All Alerts.
The bulk alert acknowledgment process runs in the background and may take a few minutes
to complete. Only one bulk alert acknowledgment can be processed at a time.
2. On the confirmation pop-up screen, to initiate acknowledgment, click OK or to exit without
acknowledgment click Cancel.
Clicking the Acknowledge All Alerts initiates a background task to acknowledge all the
matching alerts. The response either shows successfully initiated or fails.

To keep a record of the acknowledge all alerts request, a new informational alert of type
Bulk Alert Ack will be generated after the acknowledgment completes. Clear the filter and
manually refresh the table.

Alert messages
List of the alert messages that ECS uses.
Alert message Severity labels have the following meanings:
l Critical: Messages about conditions that require immediate attention
l Error: Messages about error conditions that report either a physical failure or a software failure
l Warning: Messages about less than optimal conditions
l Info: Routine status messages

Table 22 ECS Object alert messages

Alert Severity Symptom Sent to... Message Description Action


code

Btree chunk Warning 1321 Portal, API, System metadata Event trigger source Contact ECS
level GC Secure garbage Remote Support
l Example:
Remote reclamation
Reclaimed Btree
Services, throughput is too
Garbage is less
SNMP Trap, slow to catch up
than 10% of the
Syslog with garbage
remaining BTree
detection.
garbage as BTree
GC is slow at
Chunk
reclamation.

44 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

l This condition has


persisted for last 7
days, leading to
creation of this
alert.
l Derived it from
formula:
Full_Garbage >
1TB, and
Garbage_Detecte
d_Rate -
Garbage_Chunk_
Reclaim_Rate >
100GB

Btree disk Warning 1325 Portal, API, Capacity free-up Event trigger source Contact ECS
level GC Secure throughput is too Remote Support.
l Example:
Remote slow to catch up
Reclaimed Btree
Services, with system
Garbage is less
SNMP Trap, metadata garbage
than 10% of the
Syslog reclamation.
Full garbage, as
BTree GC is slow
at disk level
reclamation.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula: if
Garbage_Pending
_Delete > 1TB, and
Garbage_Chunk_
Reclaim_Rate -
Garbage_Capacity
_Reclaim_Rate >
100GB

Btree Warning 1329 Portal, API, Partial GC for Event trigger source Contact ECS
partial GC Secure system metadata Remote Support.
l Example: Rate of
Remote is too slow.
Btree Partial GC
Services,
conversion to full
SNMP Trap,
Garbage is less
Syslog
than 10% of the
Partial GC eligible
for Conversion.

ECS Monitoring Guide 45


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

l Btree partial GC
works too slow to
convert partial
garbage into full
garbage.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula : If
Partial_Eligible_G
arbage > 1TB, and
Partial_To_Full_C
onvert_Rate <
100GB

Bucket hard Error 1006 Portal, API, HardQuotaLimitE


quota SNMP Trap, xceeded: bucket
Syslog {bucket_name}

Bucket soft Warning 1008 Portal, API, SoftQuotaLimitEx


quota SNMP Trap, ceeded: bucket
Syslog {bucket_name}

Capacity Warning 1111 Portal, API, Storage pool The severity of the
alerting SNMP Trap, {Storage pool} alert depends on how
Error 1112 Syslog has {id}% close the remaining
remaining storage pool capacity
Critical 1113
capacity meeting is to reaching the
threshold of {id} configured threshold.
%. Capacity alerting is not
set by default: set
capacity alerts to
receive them. You can
set them by editing an
existing storage pool
or when you create a
storage pool.

Capacity Warning 1100 Portal, API, Used Capacity of The configured Contact ECS
exceeded Secure the VDC threshold is set at Remote Support
threshold Remote exceeded 80% of the Used representative to
Services, configured Capacity of the VDC determine the
SNMP Trap, threshold, current by default. appropriate
Syslog usage is {usage} CAUTION If the solution.
%. used capacity
reaches 90%, you
cannot write or

46 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

modify object
data.

Capacity Error 997 Portal, API, Licensed Capacity The capacity of the
license Secure Entitlement system is greater than
threshold Remote Exceeded Event was licensed.
Services,
Trap, Syslog

Chunk not Error 1004 Portal, API, chunkId {chunkId}


found Secure not found
Remote
Services,
SNMP Trap,
Syslog

CPU Usage Warning 4001 Portal, API, CPU usage is $ If CPU usage percent
Percent SNMP Trap, {inspectorValue} crosses the threshold
Error 4002 Syslog % crosses specified then the
threshold $ alert is triggered.
Critical 4003
{thresholdValue}
%

Data Error 1500 Portal, ESRS, Data Migration Data migration has no
Migration SNMP Trap, has no movement progress for several
Blocked Syslog, for ${configured} hours.
SMTP hours for a device
and level (default
6 hours).

Note: Ignore the severity as Warning, for the Data Migration Finished alert. The severity is supposed to be Info.

Data Warning 1501 Portal, ESRS, Data Migration is Data migration is


Migration SNMP Trap, complete for a complete.
Finished Syslog, device and level.
SMTP

Disabled Info 1316 Portal, API, CAS Processing is l CAS GC is Contact ECS
CAS GC Secure paused. Content Remote Support
Warning 1317 Remote Addressable representative to
Services, Storage Garbage determine the
Error 1318
SNMP Trap, Collection. appropriate
Critical 1319 Syslog solution.
l CAS GC is
disabled.

Disk Info 2031 Portal, API, Entering Disk was unmounted


unmounted SNMP Trap, maintenance
Syslog mode on node
{fqdn}

DT init Error 3001 Portal, API, There are more l DT is a directory


failure Secure than {numbers} table.

ECS Monitoring Guide 47


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

Remote DTs failed or DT l The default value


Services, stats check failed is set at 8 DTs for
SNMP Trap, in last {number} this alert to
Syslog rounds of DT trigger.
status check.
EKM Server Warning 1361 Portal, API, l The server
Certificate Secure certificate for
Expiry Error 1362 Remote EKM server
Services, expires in 30
SNMP Trap, days. Renew
Syslog the
certificate.
l The server
certificate for
EKM server
expires in 7
days. Renew
the
certificate.

EKM Server Warning 1369 Portal, API, The EKM server is


Connection Secure not responding.
Status Error 1370 Remote Ensure that the
Services, server is
SNMP Trap, connected.
Syslog

First Byte Warning 4009 Portal, API, First Byte If TTFB for read
Latency For SNMP Trap, Latency for Read latency crosses the
Read Error 4010 Syslog is $ threshold specified
{inspectorValue then the alert is
4011
}ms crosses triggered.
threshold $
{thresholdValue
}ms

Last Byte Warning 4003 Portal, API, Last Byte Latency If TTLB for write
Latency For SNMP Trap, for Write is $ latency crosses the
Write Error 4014 Syslog {inspectorValue threshold specified
}ms crosses then the alert is
Critical 4015
threshold $ triggered.
{thresholdValue
}ms

License Info 998 Portal, API, Expiration event


expiration Secure
Remote
Services,
SNMP Trap,
Syslog

48 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

License Info 100 Portal, API, Registration


registration Secure Event
Remote
Services,
SNMP Trap,
Syslog

Memory Warning 1349 Portal, API, For cm process


outside Secure memory of X
Btree Remote bytes is allocated
writes Services, outside Btree
cache SNMP Trap, write cache on
Syslog node <Node IP>.

Metering Warning 1205 Portal, API, Read latency is Contact ECS


read Secure 300 millisecond, Remote Support.
latency Error 1206 Remote crosses threshold
Services, 250 millisecond.
Critical 1207
SNMP Trap,
Syslog Read latency is
505 millisecond,
crosses threshold
500 millisecond.

Read latency is
1050 millisecond,
crosses threshold
1000 millisecond.

Metering Warning 1205 Portal, API, Write latency is Contact ECS


write Secure 300 millisecond, Remote Support.
latency Error 1206 Remote crosses threshold
Services, 250 millisecond.
Critical 1207
SNMP Trap,
Syslog Write latency is
555 millisecond,
crosses threshold
500 millisecond.

Write latency is
1500 millisecond,
crosses threshold
1000 millisecond.

Monitoring Critical 4016 Portal, API, Data recorded in


Health Secure TSDB is lagging
4017 Remote by
Services, {thresholdValue}
4018
SNMP Trap, mins on node
Syslog x.x.x.x

ECS Monitoring Guide 49


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

Namespace Error 1005 Portal, API, HardQuotaLimitE


hard quota SNMP Trap, xceeded:
Syslog Namespace
{namespace}

Namespace Warning 1009 Portal, API, SoftQuotaLimitEx


soft quota SNMP Trap, ceeded:
Syslog Namespace
{namespace}

Node Critical 2037 Disk Node maintenance


maintenanc {diskSerialnumber mode
e } on node {fqdn}
has unmounted.

Notification Any Any User-defined Custom message that


message. is defined and
provided by the user.

Process Error 1354 Portal, API, Memory table size Contact ECS
memory Secure for blob process is Remote Support.
table free Remote X % less than the
space Services, specified
percent SNMP Trap, threshold of Y %
Syslog on <node IP>.

Repo chunk Warning 1333 Portal, API, User garbage Event trigger source Contact ECS
level GC Secure collection Remote Support.
l Example: Repo
Remote throughput is too
Chunk reclamation
Services, slow to catch up
rate is less than
SNMP Trap, with garbage
10% of the
Syslog detection.
remaining
garbage.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula:
Full_Garbage >
10TB, and
Garbage_Detecte
d_Rate -
Garbage_Chunk_
Reclaim_Rate >
100GB

Repo disk Warning 1337 Portal, API, Capacity free-up Event trigger source Contact ECS
level GC Secure throughput is too Remote Support.
l Example: Repo
Remote slow to catch up
disk level GC

50 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

Services, with user garbage reclamation rate is


SNMP Trap, collection. less than 10 % of
Syslog Garbage pending
delete at disk
level.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula: If
Garbage_Pending
_Delete > 10TB,
and
Garbage_Chunk_
Reclaim_Rate -
Garbage_Capacity
_Reclaim_Rate >
100GB

Repo partial Warning 1341 Portal, API, Partial GC for Event trigger source Contact ECS
GC Secure user garbage is Remote Support.
l Example: Repo
Remote too slow.
Partial repo GC
Services,
works too slow to
SNMP Trap,
convert partial
Syslog
garbage into full
garbage.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula: If
Partial_Eligible_G
arbage > 10TB,
and
Partial_To_Full_C
onvert_Rate <
100GB

RPO Warning 1012 Portal, API, RPO for The recovery point
Secure replication group objective (RPO) is
Remote {RG} is {HH} hour greater than the RPO
Services, {SS} seconds threshold. The default
Trap, Syslog greater than {HH} value is one hour.

ECS Monitoring Guide 51


Monitoring Events: Audits and Alerts

Table 22 ECS Object alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

hour threshold
set.

Slow CAS Info 1312 Portal, API, CAS Processing CAS GC cleanup tasks
GC Object Secure object cleanup are lagging.
Cleanup Warning 1313 Remote speed is slow.
Services,
Error 1314
SNMP, Trap,
Critical 1315 Syslog

Slow CAS Info 1308 Portal, API, CAS Processing CAS GC reference
GC Secure reference collection tasks are
Reference Warning 1309 Remote collection speed is lagging.
Collection Services, slow.
Error 1310
SNMP, Trap,
Critical 1311 Syslog

Slow Info 1304 Portal, API, Journal parsing Journal parsing speed
Journal Secure speed is slow. is slow.
Parsing Warning 1305 Remote
Services,
Error 1306
SNMP, Trap,
Critical 1307 Syslog

Space Warning 4005 Portal, API, Disk space usage If Disk usage percent
Usage SNMP, Trap, is $ crosses the threshold
Percent Error 4006 Syslog {inspectorValue} specified then the
% crosses alert is triggered.
Critical 4007
threshold $
{thresholdValue}
%

SSD Read Error 1392 Portal, API, SSD read cache SSD read cache fall
Cache Secure auto clean up back to memory cache
Capacity Remote failed when after clean up failed
Failure Services, capacity full and when capacity full.
SNMP, Trap, fall back to
Syslog memory cache.

GC Status Warning 1345 Portal, API, Space Contact ECS


Secure reclamation for Remote Support.
Remote user data/system
Services, metadata is
SNMP Trap, disabled.
Syslog
Make sure it is
disabled for
temporary
purpose, and re-
enable it when
ready.

52 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 23 ECS fabric alert messages

Alert Severity Symptom Sent to... Message Description Action


code

Disk Ready Info 2061 Portal, API, Node SN={node Disk with
for SNMP Trap, sn} Disk SN=$ SUSPECT/BAD health
Replaceme Syslog, {disk sn} in is stopped using by
nt Secure rack={rack}, object service, is
Remote node={fqdn}, unmounted and is
Services slot={slot ready to be replaced.
number} is ready
for replacement.
Disk Details:
Type={disk type},
Model={vendor
model},
Size={disk size}
GB, Firmware=$
{firmware
version}.

Disk Failed Error 2062 Portal, API, Node SN={node Disk started to have
Replace SNMP Trap, sn} Disk SUSPECT/BAD
Process Syslog, SN={diskSerialNu health, Fabric started
Secure mber} in process to remove
Remote rack={rack}, that disk from usage,
Services node={fqdn}, but something went
slot={slot} cannot wrong.
be removed. Disk
Details: Type=
{disk type} ,
Model={Vendor
Model},
Size={size} GB,
Firmware={firmw
are}, reason:
{reason}

Disk Error 2063 Portal, API, Disk Fabric could not


Missing SNMP Trap, SN={diskSerialNu detect an assigned
Syslog, mber} in disk or its partition.
Secure rack={rack},
Remote node={fqdn},
Services slot={slot} is
missing. Disk
Details:
Type={HDD/
SSD},
Model={Vendor
Model},
Size={size} GB,
Firmware={firmw
are}, reason:
{reason}

ECS Monitoring Guide 53


Monitoring Events: Audits and Alerts

Table 23 ECS fabric alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

Disk added Info 2019 Portal, API, Disk Disk was added.
SNMP Trap, {diskSerialNumbe
Syslog r} on node {fqdn}
was added.

Disk failure Critical 2002 Portal, API, Disk SN= Health of disk that is
SNMP Trap, {diskSerialNumbe changed to BAD.
Syslog, r} on rack={rack},
Secure node= {fqdn} ,
Remote slot={slot
Services number} has
FAILED. Disk
Details:
Type={disk type},
Model='{VID
PID}', Size='{disk
size} GB',
Firmware={firmw
are version}"

Disk good Info 2025 Portal, API, Disk Disk was revived.
SNMP Trap, {diskSerialNumbe
Syslog r} on node {fqdn}
was revived.

Disk Info 2035 Portal, API, Disk Disk was mounted.


mounted SNMP Trap, {diskSerialNumbe
Syslog r} on node {fqdn}
has mounted.

Disk Info 2020 Portal, API, Disk Disk was removed.


removed SNMP Trap, {diskSerialNumbe
Syslog r} on node {fqdn}
was removed.

Disk Error 2003 Portal, API, Disk SN= Health of disk that is
suspect SNMP Trap, {diskSerialNumbe changed to SUSPECT.
Syslog, r} on rack={rack},
Secure node= {fqdn} ,
Remote slot={slot
Services number} has
SUSPECTED.
Disk Details:
Type={disk type},
Model='{VID
PID}', Size='{disk
size} GB',
Firmware={firmw
are version}"

Disk Warning 2036 Portal, API, Disk Disk was unmounted.


unmounted SNMP Trap, {diskSerialNumbe
Syslog

54 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 23 ECS fabric alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

r} on node {fqdn}
has unmounted.

Docker Critical 2022 Portal, API, Container Configure script


container SNMP Trap, {containerName} returned nonzero exit
configuratio Syslog, configuration has code.
n failure Secure failed on node
Remote {fqdn} with exit The configure script is
Services code {exitCode} provided by object and
{happenedOn}. called by fabric on
object container start-
up. It is only applicable
for the object
container.

Docker Warning 2017 Portal, API, Container Container paused


container SNMP Trap, {containerName}
paused Syslog has paused on
node {fqdn}.

Docker Info 2016 Portal, API, Container Container moved to


container SNMP Trap, {containerName} running state.
running Syslog is up on node
{fqdn}.

Docker Error 2015 Portal, API, Container Container stopped


container SNMP Trap, {containerName}
stopped Syslog has stopped on
node {fqdn}.

Events Error 2038 Portal, API, Events cannot be Verify configuration of


cannot be Secure delivered through the channel for which
delivered. Remote {SMTP|ESRS} the alert is.
Services, and lost.
SNMP Trap,
Syslog

Firewall Bad 2051 Portal, API, Firewall health is Rules or ip sets do not
health is Secure BAD! {reason} exist, system firewall
BAD or Suspect 2052 Remote is off, ip tables or ip
SUSPECT Services, Firewall health is
set utils do not exist.
SNMP Trap, SUSPECT!
Syslog {reason} Rules or ip sets do not
exist, trying to
recover.

Fabric Error 2014 Portal, API, FabricAgent has Fabric agent health is
agent SNMP Trap, suspected on suspect.
suspect Syslog node {fqdn}.

Net Critical 2023 Portal, API, Net interface Fabric's net interface
interface SNMP Trap, {$netInterfaceNa is down.
Syslog, me}[ on node

ECS Monitoring Guide 55


Monitoring Events: Audits and Alerts

Table 23 ECS fabric alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

health Secure $FQDN] is


down Remote down[ with IP
Services address $IP]".
Net Info 2024 Portal, API, Net interface Fabric's net interface
interface SNMP Trap, {$netInterfaceNa is up.
health up Syslog, me}[ on node
Secure $FQDN] is
Remote up[ with IP
Services address $IP]".

Net Critical 2026 Portal, API, Net interface Net interface is down
interface Secure {$netInterfaceNa for at least 10 minutes.
permanent Remote me}[ on node
down Services $FQDN] is
permanently
down[ with IP
address $IP].

Net Info 2027 Portal, API, Net interface's Fabric's net interface
interface IP SNMP Trap, {netInterfaceNam IP address changed
address Syslog e} IP address on
updated node {fqdn} was
changed to
{newIpAddress}.

Node failure Critical 2006 Portal, API, Node {fqdn} has Node is not reachable
SNMP Trap, failed. for 30 minutes.
Syslog,
Secure
Remote
Services

Node up Info 2018 Portal, API, Node {fqdn} is up. Node moved to 'up'
SNMP Trap, state after it was
Syslog down for at least 15
minutes.

Root file Warning 2039 Portal, API, Thresholds l Threshold


system SNMP Trap, exceeded, usable between 15G and
filling on Critical 2042 Syslog, space on root fs 10G triggers
node Secure <BYTES> are less warning.
Remote than threshold for
Services <LEVEL> level on
l Threshold Less
node <NODE> than 10G of free
space results in
Critical alert.

Slot Critical 2021 Portal, API, Container Container stopped/


permanent SNMP Trap, {containerName} paused or not started
down Syslog, is permanently at all for at least 10
Secure down on node minutes
Remote {fqdn}.
Services

56 ECS Monitoring Guide


Monitoring Events: Audits and Alerts

Table 23 ECS fabric alert messages (continued)

Alert Severity Symptom Sent to... Message Description Action


code

Service Critical 2011 Portal, API, Service Health Service failed


failure Syslog, Failure Event
Secure
Remote
Services

Table 24 Secure Remote Services alert messages

Alert Severity Symptom Sent to... Description


code

TestDialHome N/A TestDialHome Secure Tests that Secure Remote Services connections
Remote can be established and that the call home
Services functionality works.

ECS Monitoring Guide 57


Monitoring Events: Audits and Alerts

58 ECS Monitoring Guide


CHAPTER 4
Advanced Monitoring

l Advanced Monitoring............................................................................................................ 60
l Flux API................................................................................................................................. 74
l Dashboard APIs.....................................................................................................................95

ECS Monitoring Guide 59


Advanced Monitoring

Advanced Monitoring
Advanced Monitoring dashboards provide critical information about the ECS processes on the VDC
you are logged in to. The advanced monitoring dashboards are based on time series database, and
are provided by Grafana, which is well known open-source time series analytics platform.
Refer Grafana for basic details of navigation in Grafana dashboards.

View Advanced Monitoring Dashboards


To view the advanced monitoring dashboards in the ECS Portal, select Advanced Monitoring.
Data Access Performance - Overview dashboard is the default.

Table 25 Advanced monitoring dashboards

Dashboard Description

Data Access Performance - You can use the Data Access Performance - Overview
Overview dashboard to monitor VDC data.

Data Access Performance - by You can use the Data Access Performance - by
Namespaces Namespaces dashboard to monitor performance data
for individual namespace or group of Namespaces.

Data Access Performance - by You can use the Data Access Performance - by Nodes
Nodes dashboard to see performance data for individual node
or group of nodes in a VDC.

Data Access Performance - by You can use the Data Access Performance - by
Protocols Protocols dashboard to see performance data for each
supported protocol (S3, ATMOS, SWIFT) or set of
protocols.

Disk Bandwidth - by Nodes You can use the Disk Bandwidth - by Nodes dashboard
to monitor the disk usage metrics by read or write
operations at the node level. The dashboard displays the
latest values.

Disk Bandwidth - Overview You can use the Disk Bandwidth - Overview dashboard
to monitor the disk usage metrics by read or write
operations at the VDC level.

Node Rebalancing You can use the Node Rebalancing dashboard to


monitor the status of data rebalancing operations when
nodes are added to, or removed from, a cluster. Node
rebalancing is enabled by default at installation. Contact
your technical support representative to disable or
reenable this feature.

Process Health - by Nodes You can use the Process Health - by Nodes dashboard
to monitor for each node of the VDC use of network
interface, CPU, and available memory. The dashboard
displays the latest values, and the history graphs display
values in the selected range.

Process Health - Overview You can use the Process Health - Overview dashboard
to monitor the VDC use of network interface, CPU, and
available memory. The dashboard displays the latest

60 ECS Monitoring Guide


Advanced Monitoring

Table 25 Advanced monitoring dashboards (continued)

Dashboard Description

average values, and the history graphs display values in


the selected time range.

Process Health - Process List by You can use the Process Health - Process List by
Node Node dashboard to monitor processes use of CPU,
memory, average thread number and last restart time in
the selected time range. The dashboard displays the
latest values in the selected time range.

Recovery Status You can use the Recovery Status dashboard to monitor
the data recovered by the system.

SSD Read Cache You can use the SSD Read Cache dashboard to monitor
total SSD disk capacity and disk space that is used by
SSD read cache.

Tech Refresh: Data Migration You can use the Tech Refresh: Data Migration
dashboard to monitor the data migration off and on a
node or cluster.

Top Buckets You can use the Top Buckets dashboard to monitor the
number of buckets with top utilization that is based on
total object size and count.

Table 26 Advanced monitoring dashboard fields

Dashboard Field Description

l Data Access Related Allows you to switch to other dashboards in access


Performance - dashboards performance group, with the selected time.
Overview
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access Transaction Lists the total Successful requests, System Failures,
Performance - Summary User Failures, and Failure % Rate for the selected
Overview VDCs, namespaces, nodes, or protocols.

l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes

ECS Monitoring Guide 61


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

l Data Access
Performance - by
Protocols

l Data Access Performanc Lists the latest values of data access bandwidth and
Performance - e Summary latency of read/write requests for selected range.
Overview
l Data Access
Performance - by
Nodes

l Data Access Successful The number of data requests that were successfully
Performance - requests completed.
Overview
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access System The number of data requests that failed due to
Performance - Failures hardware or service errors. System failures are
Overview failed requests that are associated with hardware or
service errors (typically an HTTP error code of 5xx).
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access User The number of data requests from all object heads
Performance - Failures are classified as user failures. User failures are
Overview known error types originating from the object heads
(typically an HTTP error code of 4xx).
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes

62 ECS Monitoring Guide


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

l Data Access
Performance - by
Protocols

l Data Access Failure % The percentage of failures for the VDC, namespace,
Performance - Rate nodes, or protocols.
Overview
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access TPS Rate of successful requests and failures per second.
Performance - (success/
Overview failure)

l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access Bandwidth Data access bandwidth of successful requests per


Performance - (read/ second.
Overview write)

l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access Failed Rate of failed requests per second, split by error
Performance - Requests/s type (user/system).
Overview by error
type (user/
l Data Access system)
Performance - by
Namespaces

ECS Monitoring Guide 63


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access Latency Latency of read/write requests.


Performance -
Overview
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l SSD Read Cache

l Data Access Successful Displays the rate of successful requests per second,
Performance - request drill by method, node, and protocol.
Overview down

l Data Access
Performance - by
Nodes

l Data Access Successful Rate of successful requests per second, by method.


Performance - Requests/s
Overview by Method

l Data Access
Performance - by
Nodes

l Data Access Successful Rate of successful requests per second, by node.


Performance - by Requests/s
Namespaces by Node

l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access Successful Rate of successful requests per second, by protocol.


Performance - Requests/s
Overview by Protocol

64 ECS Monitoring Guide


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

l Data Access
Performance - by
Nodes

l Data Access Failures drill Displays the rate of failed requests per second, by
Performance - down method, node, and protocol.
Overview
l Data Access
Performance - by
Nodes

l Data Access Failed Rate of failed requests per second, by method.


Performance - Requests/s
Overview by Method

l Data Access
Performance - by
Nodes

l Data Access Failed Rate of failed requests per second, by node.


Performance - by Requests/s
Namespaces by Node

l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols

l Data Access Failed Rate of failed requests per second, by protocol.


Performance - Requests/s
Overview by Protocol

l Data Access
Performance - by
Nodes

l Data Access Failed Rate of failed requests per second, by error code.
Performance - Requests/s
Overview by error
code
l Data Access
Performance - by
Nodes

l Data Access Compare Select multiple nodes and compare rates of


Performance - by TPS of successful requests per second.
Nodes successful
requests
l Data Access
Performance - by
Namespaces

ECS Monitoring Guide 65


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

l Data Access
Performance - by
Protocols

Data Access Performance Compare Select multiple nodes and compare rates of failed
- by Namespaces TPS of requests per second, by error type (user/system).
failed
requests

l Data Access Compare Select multiple nodes and compare data access
Performance - by read bandwidth (read) of successful requests per second.
Nodes bandwidth

l Data Access
Performance - by
Protocols

l Data Access Compare Select multiple nodes and compare data access
Performance - by write bandwidth (write) of successful requests per
Nodes bandwidth second.

l Data Access
Performance - by
Protocols

l Data Access Compare Select multiple nodes and compare latency of read
Performance - by read requests.
Nodes latency

l Data Access
Performance - by
Protocols

l Data Access Compare Select multiple nodes and compare latency of write
Performance - by write requests.
Nodes latency

l Data Access
Performance - by
Protocols

l Data Access Compare Select multiple nodes and compare rates of failed
Performance - by rate of requests per second, split by error type (user/
Nodes failed system).
requests/s
l Data Access
Performance - by
Protocols

Data Access Performance Request Rate of requests per second, split by node.
- by Namespaces drill down
by nodes

l Disk Bandwidth - by Read or Indicates whether the row describes read data or
Nodes Write write data.

66 ECS Monitoring Guide


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

l Disk Bandwidth -
Overview

l Disk Bandwidth - by Nodes The number of nodes in the VDC. You can click the
Nodes nodes number to see the disk bandwidth metrics for
each node. There is no Nodes column when you
l Disk Bandwidth - have drilled down into the Nodes display for a VDC.
Overview

l Disk Bandwidth - by Total Total disk bandwidth that is used for either read or
Nodes write operations.

l Disk Bandwidth -
Overview

l Disk Bandwidth - by Hardware Rate at which disk bandwidth is used to recover


Nodes Recovery data after a hardware failure.

l Disk Bandwidth -
Overview

l Disk Bandwidth - by Erasure Rate at which disk bandwidth is used in system


Nodes Encoding erasure coding operations.

l Disk Bandwidth -
Overview

l Disk Bandwidth - by XOR Rate at which disk bandwidth is used in the XOR
Nodes data protection operations of the system. XOR
operations occur for systems with three or more
l Disk Bandwidth - sites (VDCs).
Overview

l Disk Bandwidth - by Consistenc Rate at which disk bandwidth is used to check for
Nodes y Checker inconsistencies between protected data and its
replicas.
l Disk Bandwidth -
Overview

l Disk Bandwidth - by Geo Rate at which disk bandwidth is used to support geo
Nodes replication operations.

l Disk Bandwidth -
Overview

l Disk Bandwidth - by User Traffic Rate at which disk bandwidth is used by object
Nodes users.

l Disk Bandwidth -
Overview

Node Rebalancing Data Amount of data that has been rebalanced.


Rebalanced

Node Rebalancing Pending Amount of data that is in the rebalance queue but
Rebalancing has not been rebalanced yet.

ECS Monitoring Guide 67


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

Node Rebalancing Rate of The incremental amount of data that was rebalanced
Rebalance during a specific time period. The default time period
(per day) is one day.

Process Health - Process Process The last time the process restarted on the node in
List by Node Restarts the selected time range. The maximum time range
could be 5 days because it is limited by the retention
policy.

Process Health - Avg. NIC Average bandwidth of the network interface


Overview Bandwidth controller hardware that is used by the selected
VDC or node.

Process Health - Process NIC Bandwidth of the network interface controller


List by Node Bandwidth hardware that is used by the selected VDC or node.

Process Health - Avg. CPU Average percentage of the CPU hardware that is
Overview Usage used by the selected VDC or node.

Process Health - Avg. Average usage of the aggregate memory available to


Overview Memory the VDC or node.
Usage

l Process Health - by Relative Percentage of the available bandwidth of the


Nodes NIC (%) network interface controller hardware that is used
by the selected VDC or node.
l Process Health -
Overview

l Process Health - by Relative Percentage of the memory used relative to the


Nodes Memory memory available to the selected VDC or node.
(%)
l Process Health -
Overview
l Process Health -
Process List by Node

l Process Health - by CPU Usage Percentage of the node's CPU used by the process.
Nodes The list of processes that are tracked is not the
complete list of processes running on the node. The
l Process Health - sum of the CPU used by the processes is not equal
Process List by Node to the CPU usage shown for the node.

Process Health - by Memory The memory used by the process.


Nodes Usage

l Process Health - by Relative Percentage of the memory used relative to the


Nodes Memory memory available to the process.
(%)
l Process Health -
Overview
l Process Health -
Process List by Node

Process Health - Process Avg. # Average number of threads used by the process.
List by Node Thread

68 ECS Monitoring Guide


Advanced Monitoring

Table 26 Advanced monitoring dashboard fields (continued)

Dashboard Field Description

Process Health - Process Last The last time the process restarted on the node.
List by Node Restart

Process Health - by Host


Nodes

Process Health - Process Process


List by Node

Recovery Status Amount of With the Current filter selected, this is the logical
Data to be size of the data yet to be recovered.
Recovered
l When a historical period is selected as the filter,
the meaning of Total Amount Data to be
Recovered is the average amount of data
pending recovery during the selected time.
l For example, if the first hourly snapshot of the
data showed 400 GB of data to be recovered in
a historical time period and every other
snapshot showed 0 GB waiting to be recovered,
the value of this field would be 400 GB divided
by the total number of hourly snapshots in the
period.

SSD Read Cache Disk Usage Used SSD space by Read Cache

SSD Read Cache Disk Total SSD disk capacity


Capacity

Tech Refresh: Data Remaining This panel shows graph of remaining volume on
Migration Volume to source nodes.
Migrate

Tech Refresh: Data Migration This panel shows graph of remaining volume on
Migration Speed source nodes.

Tech Refresh: Data Data Detailed status of migration on source nodes.


Migration Migration Migration speed and predictions are calculated
Status based on last 1 hour of currently selected time
interval.

Top buckets Top Top used buckets by size.


Buckets by
Size

Top buckets Top Top used buckets by object count.


Buckets by
Object
Count

Top buckets Time of The time at which the displayed metrics of Top
Calculation Buckets dashboard were calculated.

ECS Monitoring Guide 69


Advanced Monitoring

View mode
Procedure
1. To view a dashboard in the view mode, click the title of a dashboard, for example (TPS
(success/failure) > View.
The dashboard opens in the view mode or in the full-screen mode.
2. Click Back to dashboard icon to return back to the dashboards view.

Export CSV
Procedure
1. To export the dashboard data to .csv format click the title of a dashboard, for example (TPS
(success/failure) > More > Export CSV.
The Export CSV window pops-up.

You can customize the csv output by modifying the Mode, Date Time Format, and check/
uncheck the Excel CSV Dialect attributes.
2. Click Export > Save to export the dashboard data to .csv format to your local storage.

Data Access Performance - Overview


Data Access Performance - Overview dashboard is the default.
In the Data Access Performance - Overview dashboard, you can monitor for all nodes in the
VDC:
l TPS (success/failure)
l Bandwidth (read/write)
l Failed Requests/s by error type (user/system)
l Latency
l Successful Requests/s by Method
l Successful Requests/s by Protocol
l Failed Requests/s by Method
l Failed Requests/s by Protocol
l Failed Requests/s by error code
To view the Data Access Performance - Overview dashboard in the ECS Portal, select Advanced
Monitoring.
Click Successful requests drill down to see the successful requests by all the methods, nodes,
and protocols.
Click Failures drill down to see the failed requests by all the methods, nodes, protocols, and error
code.
Click Related dashboards to view the other dashboards, with the selected time.

Data Access Performance - by Namespaces


In the Data Access Performance - by Namespaces dashboard, you can monitor for namespaces:
l TPS (success/failure)
l Failed Requests/s by error type (user/system)

70 ECS Monitoring Guide


Advanced Monitoring

l Successful Requests/s by Node


l Failed Requests/s by Node
l Compare TPS of successful requests
l Compare TPS of failed requests
To view the Data Access Performance - by Namespaces dashboard in the ECS Portal, select
Advanced Monitoring > Related dashboards > Data Access Performance - by Namespaces.
All the namespace data are visible in the default view. To select a namespace, click the legend
parameter for the namespace below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests.

Data Access Performance - by Nodes


In the Data Access Performance - by Nodes dashboard, you can monitor for nodes in a VDC:
l TPS (success/failure)
l Bandwidth (read/write)
l Failed Requests/s by error type (user/system)
l Latency
l Successful Requests/s by Method
l Successful Requests/s by Node
l Successful Requests/s by Protocol
l Failed Requests/s by Method
l Failed Requests/s by Node
l Failed Requests/s by Protocol
l Failed Requests/s by error code
l Compare TPS of successful requests
l Compare TPS of failed requests
l Compare read bandwidth
l Compare write bandwidth
l Compare read latency
l Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select Advanced
Monitoring > Related dashboards > Data Access Performance - by Nodes.
Data for all the nodes are visible in the default view. To select data for a node, click the legend
parameter for the node below the graph.
Successful requests drill down shows the successful requests by method, node, and protocol.
Failures drill down shows the failed requests by method, node, protocol, and error code.
Compare: select multiple namespaces compares TPS of successful and failed requests, compare
read/write bandwidth, compare read/write latency.

Data Access Performance - by Protocols


In the Data Access Performance - by Protocols dashboard, based on the protocol, you can
monitor:

ECS Monitoring Guide 71


Advanced Monitoring

l TPS (success/failure)
l Bandwidth (read/write)
l Failed Requests/s by error type (user/system)
l Latency
l Successful Requests/s by Node
l Failed Requests/s by Node
l Compare TPS of successful requests
l Compare TPS of failed requests
l Compare read bandwidth
l Compare write bandwidth
l Compare read latency
l Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select Advanced
Monitoring > Related dashboards > Data Access Performance - by Protocols.
Data for all the protocols are visible in the default view. To select data for a protocol, click the
legend parameter for the protocol below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests, compare
read/write bandwidth, compare read/write latency.

Disk Bandwidth - by Nodes


You can use the Disk Bandwidth - by Nodes dashboard to monitor the disk usage metrics by read
or write operations at the node level. The dashboard displays the latest values.
To view the Disk Bandwidth - by Nodes dashboard, click Advanced Monitoring > expand Data
Access Performance - Overview > Disk Bandwidth - by Nodes

Disk Bandwidth - Overview


You can use the Disk Bandwidth - Overview dashboard to monitor the disk usage metrics by read
or write operations at the VDC level.
To view the Disk Bandwidth - Overview dashboard, click Advanced Monitoring > expand Data
Access Performance - Overview > Disk Bandwidth - Overview

Node Rebalancing
You can use the Node Rebalancing dashboard to monitor the status of data rebalancing
operations when nodes are added to, or removed from, a cluster. Node rebalancing is enabled by
default at installation. Contact your customer support representative to disable or re-enable this
feature.
To view the Node Rebalancing dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Node Rebalancing
A series of interactive graphs shows that the amount of data rebalanced, pending rebalancing, and
the rate of rebalancing data in bytes over time.
Node rebalancing works only for new nodes that are added to the cluster.

72 ECS Monitoring Guide


Advanced Monitoring

Process Health - by Nodes


You can use the Process Health - by Nodes dashboard to monitor for each node of the VDC use
of network interface, CPU, and available memory. The dashboard displays the latest values and the
history graphs display values in the selected range.
To view the Process Health - by Nodes dashboard, click Advanced Monitoring > expand Data
Access Performance - Overview > Process Health - by Nodes

Process Health - Overview


You can use the Process Health - Overview dashboard to monitor the VDC use of network
interface, CPU, and available memory. The dashboard displays the latest average values and the
history graphs display values in the selected time range.
To view the Process Health - Overview dashboard, click Advanced Monitoring > expand Data
Access Performance - Overview > Process Health - Overview

Process Health - Process List by Node


You can use the Process Health - Process List by Node dashboard to monitor processes use of
CPU, memory, average thread number and last restart time in the selected time range. The
dashboard displays the latest values in the selected time range.
To view the Process Health - Process List by Node dashboard, click Advanced Monitoring >
expand Data Access Performance - Overview > Process Health - Process List by Node

Recovery Status
You can use the Recovery Status dashboard to see:
l The latest value of the logical size of the data yet to be recovered in the selected time range,
and
l History of the amount of data that is pending recovery in the selected time range.
To view the Recovery Status dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Recovery Status.

SSD Read Cache


ECS is upgraded to enable SSD caching. There is one single SSD read cache drive per node. SSD
read cache feature is implemented on ECS Gen2 U-Series and Gen3.
If a VDC has a mixed hardware configuration where some nodes cannot support SSD read cache,
then the SSD read cache feature in such VDC is not supported.
You can use the SSD Read Cache dashboard to monitor total SSD disk capacity and disk space
that is used by SSD read cache.
Note: The nodes which do not have SSD disks are also listed in the node selection dropdown
but the values will be 0 since it does not have SSD disks.
To view the SSD Read Cache dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > SSD Read Cache
See, ECS Solve Online for details.

ECS Monitoring Guide 73


Advanced Monitoring

Tech Refresh: Data Migration


You can use the Tech Refresh: Data Migration dashboard to monitor the data migration off and
on a node or cluster.
To view the Tech Refresh: Data Migration dashboard, click Advanced Monitoring > expand Data
Access Performance - Overview > Tech Refresh: Data Migration

Top Buckets
ECS is upgraded with a mechanism in metering to calculate the number of buckets with top
utilization that is based on total object size and count.
Statistics of buckets with top utilization for the system is displayed in monitoring dashboards. The
number of buckets that are displayed on the monitoring dashboard is a configurable value.
To view the Top buckets dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Top buckets.

Automatic Metering Reconstruction


Automatic metering reconstruction is a mechanism to reconstruct the metering statistics
completely.
Metering is responsible for storing the statistics for utilization by namespace and bucket that is
based on object size and count. When an object is created in a bucket, then the statistics are
reported to the metering service where the statistics are aggregated and stored. Statistics are
aggregated and mapped to a time which is the nearest multiple of five minutes. For example,
objects that are created at 10:04:59 pm are mapped to time at 10:00:00 pm. The metering
statistics are stored in time series format to provide historical view of the statistics and to serve
billing sample queries. The statistics are displayed in a time window.
As a result of logic errors in implementation of metering, blob service side operations wrong
statistics are reported to metering. Incorrect metering information gets compounded and remains
inaccurate from that point forward. Automatic metering reconstruction is a mechanism to
overcome the problem of erroneous statistics.
This feature is disabled in ESC 3.5.0.0. You have to manually enable it.
The automatic reconstruction is invoked in the following scenarios:
l During upgrade
l When the system recovers from a PSO

Share Advanced Monitoring Dashboards


Share dashboard icon enables you to create a direct link to the dashboard or panel, share a
snapshot of an interactive dashboard publicly, and export the dashboard to a JSON file.
For procedures on sharing the dashboard link, dashboard snapshot, and dashboard as a JSON file,
refer to Grafana documentation.

Flux API
Flux API enables you to retrieve time series database data by sending REST queries using curl. You
can get raw data from fluxd service in a way similar to using the Dashboard API. You have to get
a token, and provide the token in the requests.
Before you begin
Requires one of the following roles:

74 ECS Monitoring Guide


Advanced Monitoring

l SYSTEM_ADMIN
l SYSTEM_MONITOR
Request payload examples

json

{
"query": "from(bucket:\"monitoring_main\") |> range(start: -30m) |>
filter(fn: (r) => r._measurement ==
\"statDataHead_performance_internal_transactions\")"
}

application/vnd.flux - CSV format

query=from(bucket: "monitoring_main")
|> range(start: -30m)
|> filter(fn: (r) => r._measurement ==
"statDataHead_performance_internal_transactions")

Procedure
1. Generate a token.

Token

admin@ecs:> tok=$(curl -iks https://localhost:4443/login -u


emcmonitor:#### | grep X-SDS-AUTH-TOKEN)

admin@ecs:/> echo $tok


X-SDS-AUTH-TOKEN:****

#### represents a password.

**** represents a X-SDS-AUTH-TOKEN value.


2. Run the query.
Curl arguments varies depending on output format (JSON or CSV). See the examples for
details.

JSON example

admin@ecs:/> curl https://localhost:4443/flux/api/external/v2/query


-XPOST -k -sS -H "$tok" -H 'accept:application/json' -H 'content-
type:application/json' -d '{
"query": "from(bucket:\"monitoring_main\") |> range(start: -30m) |>
filter(fn: (r) => r._measurement ==
\"statDataHead_performance_internal_transactions\")" }'
{
"Series": [
{
"Datatypes": [
"long",
"dateTime:RFC3339",
"dateTime:RFC3339",
"dateTime:RFC3339",
"long",

ECS Monitoring Guide 75


Advanced Monitoring

"string",
"string",
"string",
"string",
"string",
"string"
],
"Columns": [
"table",
"_start",
"_stop",
"_time",
"_value",
"_field",
"_measurement",
"host",
"node_id",
"process",
"tag"
],
"Values": [
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T09:56:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T10:01:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T10:06:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],

CSV example

76 ECS Monitoring Guide


Advanced Monitoring

admin@ecs:> curl https://localhost:4443/flux/api/external/v2/query -


XPOST -k -sS -H "$tok" -H 'accept:application/csv' -H 'content-
type:application/vnd.flux' -d 'from(bucket:"monitoring_main") |>
range(start:-30m) |> filter(fn: (r) => r._measurement ==
"statDataHead_performance_internal_transactions")'
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC
3339,long,string,string,string,string,string,string
#group,false,false,true,true,false,false,true,true,true,true,true,tr
ue
#default,_result,,,,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,host,nod
e_id,process,tag
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,20
20-03-10T10:01:43Z,1,failed_request_counter,statDataHead_performance
_internal_transactions,ecs.lss.emc.com,28cd473e-ca45-4623-
b30d-0481c548a650,statDataHead,dashboard
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,20
20-03-10T10:06:43Z,1,failed_request_counter,statDataHead_performance
_internal_transactions,ecs.lss.emc.com,28cd473e-ca45-4623-
b30d-0481c548a650,statDataHead,dashboard
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,20
20-03-10T10:11:43Z,1,failed_request_counter,statDataHead_performance
_internal_transactions,ecs.lss.emc.com,28cd473e-ca45-4623-
b30d-0481c548a650,statDataHead,dashboard
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,20
20-03-10T10:16:43Z,1,failed_request_counter,statDataHead_performance
_internal_transactions,ecs.lss.emc.com,28cd473e-ca45-4623-
b30d-0481c548a650,statDataHead,dashboard

Monitoring list of metrics


Following tags have common values across all measurements:
l host- name of data node
l node_id- ID of data node
l tag- internal, set to dashboard

Monitoring list of metrics: Non-Performance


Database monitoring_main
Performance metrics in this database are raw, each is split by data node, that is all have host and
node_id tags.
Data for ECS Service I/O Statistics

Information:
Measurement in this section have following structure:

service_IO_Statistics_data_read - for read I/O counters

service_IO_Statistics_data_write - for read I/O counters

Service is the name of ECS service that produces the measurement, i.e. blob,
cm, georcv, statDataHead.

For example,

ECS Monitoring Guide 77


Advanced Monitoring

blob_IO_Statistics_data_read
cm_IO_Statistics_data_write

Measurement: blob_IO_Statistics_data_read
...
Tags: host, node_id, process, tag
Fields: read_CCTotal (float, bytes)
read_ECTotal (float, bytes)
read_GEOTotal (float, bytes)
read_RECOVERTotal (float, bytes)
read_USERTotal (float, bytes)
read_XORTotal (float, bytes)

Measurement: blob_IO_Statistics_data_write
...
Tags: host, node_id, process, tag
Fields: write_CCTotal (integer)
write_ECTotal (integer)
write_GEOTotal (integer)
write_RECOVERTotal (integer)
write_USERTotal (integer)
write_XORTotal (integer)

Data for SSD Read cache

Measurement: blob_SSDReadCache_Stats
Tags: host, id, last, node_id, process
Fields: +Inf (integer)
0.0 (integer)
1000.0 (integer)
25000.0 (integer)
5000.0 (integer)
rocksdb_disk_capacity_failure_counter (integer)
rocksdb_disk_usage_counter_bytes (integer)
rocksdb_disk_usage_percentage_counter (integer)
ssd_capacity_counter_bytes (integer)

CM statistics
These statistics represent processes in ECS service CM, such BTree GC, Chunk management,
Erasure coding.

Measurement: cm_BTREE_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_candidate_garbage_btree_gc_level_0 (integer)
accumulated_candidate_garbage_btree_gc_level_1 (integer)
accumulated_detected_data_btree_level_0 (integer)
accumulated_detected_data_btree_level_1 (integer)
accumulated_reclaimed_data_btree_level_0 (integer)
accumulated_reclaimed_data_btree_level_1 (integer)
candidate_chunks_btree_gc_level_0 (integer)
candidate_chunks_btree_gc_level_1 (integer)
candidate_garbage_btree_gc_level_0 (integer)
candidate_garbage_btree_gc_level_1 (integer)
copy_candidate_chunks_btree_gc_level_0 (integer)
copy_candidate_chunks_btree_gc_level_1 (integer)
copy_completed_chunks_btree_gc_level_0 (integer)
copy_completed_chunks_btree_gc_level_1 (integer)
copy_waiting_chunks_btree_gc_level_0 (integer)
copy_waiting_chunks_btree_gc_level_1 (integer)
deleted_chunks_btree_level_0 (integer)
deleted_chunks_btree_level_1 (integer)

78 ECS Monitoring Guide


Advanced Monitoring

deleted_data_btree_level_0 (integer)
deleted_data_btree_level_1 (integer)
full_reclaimable_chunks_btree_gc_level_0 (integer)
full_reclaimable_chunks_btree_gc_level_1 (integer)
reclaimed_data_btree_level_0 (integer)
reclaimed_data_btree_level_1 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_0 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_1 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_0 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_1 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_0 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_1 (integer)
verification_waiting_chunks_btree_gc_level_0 (integer)
verification_waiting_chunks_btree_gc_level_1 (integer)

Measurement: cm_Chunk_Statistics
Tags: host, node_id, process, tag
Fields: chunks_copy (integer)
chunks_copy_active (integer)
chunks_copy_s0 (integer)
chunks_level_0_btree (integer)
chunks_level_0_btree_active (integer)
chunks_level_0_btree_active_index_page (integer)
chunks_level_0_btree_active_leaf_page (integer)
chunks_level_0_btree_index_page (integer)
chunks_level_0_btree_leaf_page (integer)
chunks_level_0_btree_s0 (integer)
chunks_level_0_btree_s0_index_page (integer)
chunks_level_0_btree_s0_leaf_page (integer)
chunks_level_0_journal (integer)
chunks_level_0_journal_active (integer)
chunks_level_0_journal_s0 (integer)
chunks_level_1_btree (integer)
chunks_level_1_btree_active (integer)
chunks_level_1_btree_active_index_page (integer)
chunks_level_1_btree_active_leaf_page (integer)
chunks_level_1_btree_index_page (integer)
chunks_level_1_btree_leaf_page (integer)
chunks_level_1_btree_s0 (integer)
chunks_level_1_btree_s0_index_page (integer)
chunks_level_1_btree_s0_leaf_page (integer)
chunks_level_1_journal (integer)
chunks_level_1_journal_active (integer)
chunks_level_1_journal_s0 (integer)
chunks_repo (integer)
chunks_repo_active (integer)
chunks_repo_s0 (integer)
chunks_typeII_ec_pending (integer)
chunks_typeI_ec_pending (integer)
chunks_undertransform_ec_pending (integer)
chunks_xor (integer)
data_copy (integer)
data_level_0_btree (integer)
data_level_0_btree_index_page (integer)
data_level_0_btree_leaf_page (integer)
data_level_0_journal (integer)
data_level_1_btree (integer)
data_level_1_btree_index_page (integer)
data_level_1_btree_leaf_page (integer)
data_level_1_journal (integer)
data_repo (integer)
data_repo_copy (integer)
data_xor (integer)
data_xor_shipped (integer)

Measurement: cm_EC_Statistics
Tags: host, node_id, process, tag
Fields: chunks_ec_encoded (integer)

ECS Monitoring Guide 79


Advanced Monitoring

chunks_ec_encoded_alive (integer)
data_ec_encoded (integer)
data_ec_encoded_alive (integer)

Measurement: cm_Geo_Replication_Statistics_Geo_Chunk_Cache
Tags: host, node_id, process, tag
Fields: Capacity_of_Cache (integer)
Number_of_Chunks (integer)

Measurement: cm_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_deleted_garbage_repo (integer)
accumulated_reclaimed_garbage_repo (integer)
deleted_chunks_repo (integer)
deleted_data_repo (integer)
ec_freed_slots (integer)
full_reclaimable_aligned_chunk (integer)
merge_copy_overhead_in_deleted_data_repo (integer)
merge_copy_overhead_in_reclaimed_data_repo (integer)
reclaimed_chunk_repo (integer)
reclaimed_data_repo (integer)
slots_waiting_shipping (integer)
slots_waiting_verification (integer)
total_ec_free_slots (integer)

Measurement: cm_Rebalance_Statistics
Tags: host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)

Measurement: cm_Rebalance_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)

Measurement: cm_Recover_Statistics
Tags: host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)

Measurement: cm_Recover_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)

80 ECS Monitoring Guide


Advanced Monitoring

SR statistics
These statistics represent processes in ECS service SR, responsible for space reclamation.

Measurement: sr_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_merge_copy_overhead_in_full_garbage (integer)
accumulated_total_repo_garbage (integer)
full_reclaimable_repo_chunk (integer)
garbage_in_partial_sr_tasks (integer)
garbage_in_repo_usage (integer)
merge_copy_overhead_in_full_garbage (integer)
merge_way_gc_processed_chunks (integer)
merge_way_gc_src_chunks (integer)
merge_way_gc_targeted_chunks (integer)
merge_way_gc_tasks (integer)
total_repo_garbage (integer)
usage_between_0%_and_33.3%_repo_chunk (integer)
usage_between_33.3%_and_50%_repo_chunk (integer)
usage_between_50%_and_66.7%_repo_chunk (integer)

SSM statistics
These statistics represent processes in ECS storage manager service SSM.

Measurement: ssm_sstable_SSTable_SS
Tags: SS, SSTable, last, process, tag
Fields: allocatedSpace (integer)
availableFreeSpace (integer)
downDurationTotal (integer)
freeSpace (integer)
largeBlockAllocated (integer)
largeBlockAllocatedSize (integer)
largeBlockFreed (integer)
largeBlockFreedSize (integer)
pendingDurationTotal (integer)
pingerDurationTotal (integer)
smallBlockAllocated (integer)
smallBlockFreed (integer)
smallBlockFreedSize (integer)
smallBlockSize (integer)
state (string)
timeInStateTotal (integer)
totalSpace (integer)
upDurationTotal (integer)

Measurement: ssm_sstable_SSTable_SS_datamigration
Tags: SS, SSTable, last, process
Fields: status (integer)
totalCapacityToMigrate (integer)

Database monitoring_last
Service status, memory, and cache statistics

Measurement: blob_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: blob_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)

ECS Monitoring Guide 81


Advanced Monitoring

Measurement: cm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: eventsvc_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: mm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: resource_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: rm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: sr_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: sr_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)

Measurement: ssm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Export of configuration framework values

Measurement: dtquery_cmf
Tags: last, process
Fields: com.emc.ecs.chunk.gc.btree.enabled (integer)
com.emc.ecs.chunk.gc.btree.scanner.verification.enabled (integer)
com.emc.ecs.chunk.gc.repo.enabled (integer)
com.emc.ecs.chunk.gc.repo.verification.enabled (integer)
com.emc.ecs.chunk.rebalance.is_enabled (integer)
com.emc.ecs.objectgc.cas.enabled (integer)
com.emc.ecs.sensor.btree_sr_pending_mininum (integer)
com.emc.ecs.sensor.repo_sr_pending_mininum (integer)

Top bucket statistics

Measurement: mm_topn_bucket_by_obj_count_place
Tags: last, place, process, tag
Fields: bucketName (string)
namespace (string)
value (integer)

Measurement: mm_topn_bucket_by_obj_size_place

82 ECS Monitoring Guide


Advanced Monitoring

Tags: last, place, process, tag


Fields: bucketName (string)
namespace (string)
value (integer)

Vnest membership and performance statistics

Measurement: vnestStat_membership_ismember
Tags: host, ismember, last, node_id, process
Fields: is_leader (string)

Measurement: vnestStat_performance_latency_type
Tags: host, id, last, node_id, process, type
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
7999999.99999999 (integer)
825912.9477680004 (integer)
85266.52466135359 (integer)
8802.840841123942 (integer)
9.686250859269972 (integer)
908.7975284781536 (integer)
93.82345570870827 (integer)

Measurement: vnestStat_performance_transactions_from_type
Tags: from, host, last, node_id, process, type
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Database monitoring_op
Node system level statistics

Information:
Measurements listed in this section are from default Telegraf plugins. Here,
measurement name equals plugin name. Refer to plugin documentation for more
information.

For example, documentation for Telegraf plugin "cpu" can be found here.

Measurement: cpu
Tags: cpu, host, node_id, tag
Fields: usage_guest (float)
usage_guest_nice (float)
usage_idle (float)
usage_iowait (float)
usage_irq (float)
usage_nice (float)
usage_softirq (float)
usage_steal (float)
usage_system (float)
usage_user (float)

Measurement: disk
Tags: device, fstype, host, mode, node_id, path, tag
Fields: free (integer)
inodes_free (integer)
inodes_total (integer)
inodes_used (integer)
total (integer)
used (integer)
used_percent (float)

ECS Monitoring Guide 83


Advanced Monitoring

Measurement: diskio
Tags: ID_PART_ENTRY_UUID, SCSI_IDENT_SERIAL, SCSI_MODEL, SCSI_REVISION,
SCSI_VENDOR, host, name, node_id, tag
Fields: io_time (integer)
iops_in_progress (integer)
read_bytes (integer)
read_time (integer)
reads (integer)
weighted_io_time (integer)
write_bytes (integer)
write_time (integer)
writes (integer)

Measurement: linux_sysctl_fs
Tags: host, node_id, tag
Fields: aio-max-nr (integer)
aio-nr (integer)
dentry-age-limit (integer)
dentry-nr (integer)
dentry-unused-nr (integer)
dentry-want-pages (integer)
file-max (integer)
file-nr (integer)
inode-free-nr (integer)
inode-nr (integer)
inode-preshrink-nr (integer)

Measurement: mem
Tags: host, node_id, tag
Fields: active (integer)
available (integer)
available_percent (float)
buffered (integer)
cached (integer)
commit_limit (integer)
committed_as (integer)
dirty (integer)
free (integer)
high_free (integer)
high_total (integer)
huge_page_size (integer)
huge_pages_free (integer)
huge_pages_total (integer)
inactive (integer)
low_free (integer)
low_total (integer)
mapped (integer)
page_tables (integer)
shared (integer)
slab (integer)
swap_cached (integer)
swap_free (integer)
swap_total (integer)
total (integer)
used (integer)
used_percent (float)
vmalloc_chunk (integer)
vmalloc_total (integer)
vmalloc_used (integer)
wired (integer)
write_back (integer)
write_back_tmp (integer)

Measurement: net
Tags: host, interface, node_id, tag
Fields: bytes_recv (integer)
bytes_sent (integer)

84 ECS Monitoring Guide


Advanced Monitoring

bytes_sum (integer)
drop_in (integer)
drop_out (integer)
err_in (integer)
err_out (integer)
packets_recv (integer)
packets_sent (integer)
packets_sum (integer)
speed (integer)
utilization (integer)

Measurement: nstat
Tags: host, name, node_id, tag
Fields: IpExtInOctets (integer)
IpExtOutOctets (integer)
TcpInErrs (integer)
UdpInErrors (integer)

Measurement: processes
Tags: host, node_id, tag
Fields: blocked (integer)
dead (integer)
idle (integer)
paging (integer)
running (integer)
sleeping (integer)
stopped (integer)
total (integer)
total_threads (integer)
unknown (integer)
zombies (integer)

Measurement: procstat
Tags: host, node_id, process_name, tag, user
Fields: cpu_time (integer)
cpu_time_guest (float)
cpu_time_guest_nice (float)
cpu_time_idle (float)
cpu_time_iowait (float)
cpu_time_irq (float)
cpu_time_nice (float)
cpu_time_soft_irq (float)
cpu_time_steal (float)
cpu_time_stolen (float)
cpu_time_system (float)
cpu_time_user (float)
cpu_usage (float)
create_time (integer)
involuntary_context_switches (integer)
memory_data (integer)
memory_locked (integer)
memory_rss (integer)
memory_stack (integer)
memory_swap (integer)
memory_vms (integer)
nice_priority (integer)
num_fds (integer)
num_threads (integer)
pid (integer)
read_bytes (integer)
read_count (integer)
realtime_priority (integer)
rlimit_cpu_time_hard (integer)
rlimit_cpu_time_soft (integer)
rlimit_file_locks_hard (integer)
rlimit_file_locks_soft (integer)
rlimit_memory_data_hard (integer)

ECS Monitoring Guide 85


Advanced Monitoring

rlimit_memory_data_soft (integer)
rlimit_memory_locked_hard (integer)
rlimit_memory_locked_soft (integer)
rlimit_memory_rss_hard (integer)
rlimit_memory_rss_soft (integer)
rlimit_memory_stack_hard (integer)
rlimit_memory_stack_soft (integer)
rlimit_memory_vms_hard (integer)
rlimit_memory_vms_soft (integer)
rlimit_nice_priority_hard (integer)
rlimit_nice_priority_soft (integer)
rlimit_num_fds_hard (integer)
rlimit_num_fds_soft (integer)
rlimit_realtime_priority_hard (integer)
rlimit_realtime_priority_soft (integer)
rlimit_signals_pending_hard (integer)
rlimit_signals_pending_soft (integer)
signals_pending (integer)
voluntary_context_switches (integer)
write_bytes (integer)
write_count (integer)

Measurement: swap
Tags: host, node_id, tag
Fields: free (integer)
in (integer)
out (integer)
total (integer)
used (integer)
used_percent (float)

Measurement: system
Tags: host, node_id, tag
Fields: load1 (float)
load15 (float)
load5 (float)
n_cpus (integer)
n_users (integer)
uptime (integer)
uptime_format (string)

DT statistics

Measurement: dtquery_dt_dist_dt_node_id_type
Tags: dt_node_id, process, tag, type
Fields: count_i (integer)

Measurement: dtquery_dt_dist_host_dt_node_id
Tags: dt_node_id, process, tag
Fields: count_i (integer)

Measurement: dtquery_dt_dist_type_type
Tags: process, tag, type
Fields: count_i (integer)

Measurement: dtquery_dt_status
Tags: process, tag
Fields: total (integer)
unknown (integer)
unready (integer)

Measurement: dtquery_dt_status_detailed_type
Tags: process, tag, type
Fields: total (integer)

86 ECS Monitoring Guide


Advanced Monitoring

unknown (integer)
unready (integer)

Fabric agent statistics

Measurement: ecs_fabric_agent_dirstat_size_bytes
Tags: host, node_id, path, tag, url
Fields: gauge (float)

SR journal statistics

Measurement: sr_JournalParser_GC_RG_DT
Tags: DT, RG, last, process
Fields: majorMinorOfJournalRegion (string)
pendingChunks (integer)
timestampOfChunkRegion (string)
timestampOfJournalParserLastRun (string)

Measurement: sr_ObjectGC_CAS_RG
Tags: RG, last, process
Fields: STATUS (string)

Vnest Btree statistics

Measurement: vnestStat_btree
Tags: cumulative_stats, host, level, node_id, tag
Fields: level_count (float)
page_count (float)
size_bytes (float)

Database monitoring_vdc
Metrics in this database are calculated values over whole VDC without reference to particular data
node.

Information:

Metrics below are aggregated over data nodes for raw measurements used in
Grafana ECS UI.

Measurement: cq_disk_bandwidth
Tags: type_op ('read', 'write')
Fields: consistency_checker (float)
erasure_encoding (float)
geo (float)
hardware_recovery (float)
total (float)
user_traffic (float)
xor (float)

Measurement: cq_node_rebalancing_summary
Tags: none
Fields: data_rebalanced (integer)
pending_rebalance (integer)

ECS Monitoring Guide 87


Advanced Monitoring

Measurement: cq_process_health
Tags: none
Fields: cpu_used (float)
mem_used (float)
mem_used_percent (float)
nic_bytes (float)
nic_utilization (float)

Measurement: cq_recover_status_summary
Tags: none
Fields: data_recovered (integer)
data_to_recover (integer)

Monitoring list of metrics: Performance


Information about generic tag values
Following tags have common values across all measurements:
l process- internal, set to statDataHead
l head- type of protocol, that is S3
l namespace- name of namespace
l method - protocol-specific request method, that is GET, POST, READ, WRITE
Database monitoring_main
Performance metrics in this database are raw, each is split by data node, that is all have node and
node_id tags.
Most of integer fields are increasing counters that is values that increase over time. Increasing
counters restart from zero after datahead service restart.

Measurement: statDataHead_performance_internal_error
Tags: host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)

Measurement: statDataHead_performance_internal_error_code
Tags: code, host, node_id, process, tag
Fields: error_counter (integer)

Measurement: statDataHead_performance_internal_error_head
Tags: head, host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)

Measurement: statDataHead_performance_internal_error_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)

Measurement: statDataHead_performance_internal_latency
Tags: host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)

88 ECS Monitoring Guide


Advanced Monitoring

Measurement: statDataHead_performance_internal_latency_head
Tags: head, host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)

Measurement: statDataHead_performance_internal_throughput
Tags: host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)

Measurement: statDataHead_performance_internal_throughput_head
Tags: head, host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)

Measurement: statDataHead_performance_internal_transactions
Tags: host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_head
Tags: head, host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_method
Tags: host, method, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Database monitoring_vdc
Performance metrics in this database are calculated values over whole VDC without reference to
particular data node.
Most of values are:
l Rates (number of requests per seconds)- for all measurements not ending by "_delta"
l Delta values, increase of a counter from previous time stamp- for all measurements ending by
"_delta"
l Down sampled values (aggregated one point per day)- for all measurements ending by
"_downsampled"

Measurement: cq_performance_error
Tags: none
Fields: system_errors (float)
user_errors (float)

Measurement: cq_performance_error_downsampled
Tags: none
Fields: system_errors (float)

ECS Monitoring Guide 89


Advanced Monitoring

user_errors (float)
Measurement: cq_performance_error_code
Tags: code
Fields: error_counter (float)

Measurement: cq_performance_error_code_downsampled
Tags: code
Fields: error_counter (float)
Measurement: cq_performance_error_delta
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)

Measurement: cq_performance_error_delta_downsampled
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_head
Tags: head
Fields: system_errors (float)
user_errors (float)

Measurement: cq_performance_error_head_downsampled
Tags: head
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_head_delta
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)

Measurement: cq_performance_error_head_delta_downsampled
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_ns
Tags: namespace
Fields: system_errors (float)
user_errors (float)

Measurement: cq_performance_error_ns_downsampled
Tags: namespace
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_ns_delta
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)

Measurement: cq_performance_error_ns_delta_downsampled
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_latency
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_downsampled
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_head
Tags: head, id
Fields: p50 (float)
p99 (float)

Measurement: cq_performance_latency_head_downsampled

90 ECS Monitoring Guide


Advanced Monitoring

Tags: head, id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_throughput
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)

Measurement: cq_performance_throughput_downsampled
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_head
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)

Measurement: cq_performance_throughput_head_downsampled
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_transaction
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)

Measurement: cq_performance_transaction_downsampled
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_delta
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_delta_downsampled
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_head
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)

Measurement: cq_performance_transaction_head_downsampled
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_head_delta
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_head_delta_downsampled
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_method
Tags: method
Fields: failed_request_counter (float)
succeed_request_counter (float)

Measurement: cq_performance_transaction_method_downsampled
Tags: method
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns
Tags: namespace

ECS Monitoring Guide 91


Advanced Monitoring

Fields: failed_request_counter (float)


succeed_request_counter (float)

Measurement: cq_performance_transaction_ns_downsampled
Tags: namespace
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns_delta
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_ns_delta_downsampled
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Flux API replacements for deprecated dashboard API


Processes statistics
Dashboard API

GET /dashboard/nodes/{id}/processes

GET /dashboard/processes/{id}

Flux API
Database:
l monitoring_op
Measurement:
l procstat(detailed info on available fields and tags https://github.com/influxdata/telegraf/
tree/master/plugins/inputs/procstat)
Fields:
l memory_rss- resident memory of a process (bytes)
l cpu_usage- cpu usage percentage for a process (percent used of a single cpu)
l num_threads- number of threads used by process (int)
Tags:
l process_name- valid process names:
n blobsvc
n cm
n coordinatorsvc
n dataheadsvc
n dtquery
n ecsportalsvc
n eventsvc
n georeceiver
n metering

92 ECS Monitoring Guide


Advanced Monitoring

n objcontrolsvc
n resourcesvc
n transformsvc
n vnest
n fluxd
n influxd
n throttler
n grafana-server
n dockerd
n fabric-agent
n fabric-lifecycle
n fabric-registry
n fabric-zookeeper
l host- hostname (fqdn)
l node_id- host id
Note:

For replacement of /dashboard/processes/{id}, specify corresponding r.process_name


and r.node_id fields accordingly to "{id}" value.

For example, id "330e4b8f-4491-4ec7-b816-7b10ac9c6abf-cm" equals to:

r.node_id == "330e4b8f-4491-4ec7-b816-7b10ac9c6abf"
r.process_name == "cm"

Example query:

from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "procstat" and r._field ==
"memory_rss" and r.process_name == "vnest" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "process_name"])

Example output:

#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,process_name
,,0,2019-08-15T13:05:00Z,2505809920,vnest
,,0,2019-08-15T13:10:00Z,2505887744,vnest
,,0,2019-08-15T13:15:00Z,2506014720,vnest
,,0,2019-08-15T13:20:01Z,2506010624,vnest

Nodes statistics
Dashboard API

GET /dashboard/nodes/{id}

Database:

ECS Monitoring Guide 93


Advanced Monitoring

l monitoring_op
Measurement:
l cpu (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/
master/plugins/inputs/cpu)
Fields:
l usage_idle- idle cpu usage (percents)
Tags:
l host- hostname (fqdn)
l node_id- host id
Example query:

from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "cpu" and r.cpu == "cpu-total" and
r._field == "usage_idle" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])

Example output:

#datatype,string,long,dateTime:RFC3339,double,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,host
,,0,2019-08-15T13:20:00Z,19.549454477395525,host_name
,,0,2019-08-15T13:25:00Z,17.920104933062728,host_name
,,0,2019-08-15T13:30:00Z,18.050788903551002,host_name
,,0,2019-08-15T13:35:00Z,19.801364027505095,host_name

Measurement:
l mem (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/
master/plugins/inputs/mem)
Fields:
l free- free memory on host (bytes)
Tags:
l host- hostname (fqdn)
l node_id- host id
Example query:

from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "mem" and r._field == "free" and
r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])

Example output:

#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,

94 ECS Monitoring Guide


Advanced Monitoring

,result,table,_time,_value,host
,,0,2019-08-15T14:10:00Z,3181088768,host_name
,,0,2019-08-15T14:15:00Z,2988388352,host_name
,,0,2019-08-15T14:20:00Z,3002994688,host_name
,,0,2019-08-15T14:25:00Z,3115741184,host_name

Performance statistics
Dashboard API

GET /dashboard/nodes/{id}

GET /dashboard/zones/localzone

GET /dashboard/zones/localzone/nodes

Dashboard APIs
Lists the APIs that are deprecated.
APIs removed in ECS 3.5.0
The following table lists the APIs that are removed in ECS 3.5.0:

Table 27 APIs removed in ECS 3.5.0

API Name Syntax Description

Get Process GET /dashboard/processes/{id} Gets the process instance details.

Get Node GET /dashboard/nodes/{id}/ Gets the details of processes in the


Processes processes node.

ECS Monitoring Guide 95


Advanced Monitoring

96 ECS Monitoring Guide


CHAPTER 5
Examining Service Logs

l ECS service logs....................................................................................................................98

ECS Monitoring Guide 97


Examining Service Logs

ECS service logs


Describes the location and content of ECS service logs.
You can access ECS service logs directly by an SSH session on a node. Change to the following
directory: /opt/emc/caspian/fabric/agent/services/object/main/log. You can also
access the logs from the Service Console. The following logs are provided:
Note:
The emcservice user cannot access service logs. When the node is locked using the platform
lockdown feature, a user cannot access service logs. Only an administrator who has permission
to access the node can access the logs.
l authsvc.log: Records information from the authentication service
l blobsvc*.log: Records aspects of the binary large object service (BLOB) service
l cassvc*.log: Records aspects of the CAS service
l coordinatorsvc.log: Records information from the coordinator service
l ecsportalsvc.log: Records information from the ECS Portal service
l eventsvc*.log: Records aspects of the event service. This information is available in the
ECS Portal at Monitor > Events
l hdfssvc*.log: Records aspects of the HDFS service
l objcontrolsvc.log: Records information from the object service
l objheadsvc*.log: Records aspect of the various object heads supported by the object
service.
l provisionsvc*.log: Records aspects of the ECS provisioning service
l resourcesvc*.log: Records information that is related to global resources like namespaces,
buckets, object users
l dataheadsvc-access.log: Records the aspects of the object heads supported by the
object service, the file service supported by HDFS, and the CAS service.

98 ECS Monitoring Guide

You might also like