Professional Documents
Culture Documents
ECS 3.5 Monitoring Guide
ECS 3.5 Monitoring Guide
Version 3.5
Monitoring Guide
Rev01
May 2020
Copyright © 2019-2020 Dell Inc. or its subsidiaries. All rights reserved.
Dell believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS-IS.” DELL MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. USE, COPYING, AND DISTRIBUTION OF ANY DELL SOFTWARE DESCRIBED
IN THIS PUBLICATION REQUIRES AN APPLICABLE SOFTWARE LICENSE.
Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be the property
of their respective owners. Published in the USA.
Dell EMC
Hopkinton, Massachusetts 01748-9103
1-508-435-1000 In North America 1-866-464-7381
www.DellEMC.com
Figures 5
Tables 7
View requests
The Requests panel displays the total requests, successful requests, and failed requests.
Failed requests are organized by system error and user error. User failures are typically HTTP 400
errors. System failures are typically HTTP 500 errors. Click Requests to see more request
metrics.
Request statistics do not include replication traffic.
you have a load balancer that detects a failed write, the load balancer can redirect the write to
another VDC.
Capacity amounts are shown in gibibytes (GiB) and tibibytes (TiB). One GiB is approximately equal
to 1.074 gigabytes (GB). One TiB is approximately equal to 1.1 terabytes (TB).
The Used capacity indicates the amount of capacity that is in use. Click Capacity Utilization to
see more capacity metrics.
The capacity metrics are available in the left menu.
View performance
The Performance panel displays how network read and write operations are currently performing,
and the average read/write performance statistics over the last 24 hours for the VDC.
Click Performance to see more comprehensive performance metrics.
Note: There will be a label of SSD Cache Enabled if the feature is on the node. And if Read
Cache is disabled or the nodes do not have SSD disks there will be no SSD Cache Enabled
label.
View alerts
The Alerts panel displays a count of critical alerts and errors.
Click Alerts to see the full list of current alerts. Any Critical or Error alerts are linked to the Alerts
tab on the Events page where only the alerts with a severity of Critical or Error are filtered and
displayed.
Note: Alerts can also be filtered with Severity Info and Warning.
View audits
Audits can be filtered only with date time range and namespace.
Table navigation
Highlighted text in a table row indicates a link to a detail display. Selecting the link drills down to
the next level of detail. On drill-down displays, a path string shows your current location in the
sequence of drill-down displays. This path string is called a breadcrumb trail or breadcrumbs for
short. Selecting any highlighted breadcrumb jumps up to the associated display.
On some monitoring displays, you can force a table to refresh with the latest data by clicking the
Refresh icon.
Figure 2 Refresh icon
Figure 3 Open Filter panel with date and time range selections
When the table has the Current filter applied, the latest values are displayed. When the table has a
date-time range filter applied, it displays the average value over that period.
History
When you select a History button, all available charts for that row are displayed below the table.
You can hover over a chart from left to right to see a vertical line that helps you find a specific
date-time point on the chart. A pop-up display shows the value and timestamp for that point.
The date-time scale is determined by the filter setting that has been configured. When the
Current filter is selected, the charts show data from the last 24 hours. History data is kept for 60
days.
Figure 4 History chart with active cursor
In the history charts, when the Current filter is selected, if there is no available historical data, No
Data displays.
Export icon
Export icon enables you to export data from all the monitoring tables and graphs to pdf, doc, excel.
and .csv formats for later consumption. To select the format, and export the data, use the export
icon in the upper right of the menu bar on each table and graph.
The exported data can be used to get a longer term view on capacity usage and consumption
trends.
Figure 5 Export icons
If you select Custom, use the From and To calendars to choose the time period for which
data will be displayed.
3. Select the namespace for which you want to display metering data. To narrow the list of
namespaces, type the first few letters of the target namespace and click the magnifying
glass icon.
If you are a Namespace Administrator, you will only be able to select your namespace.
4. Click the + icon next to each namespace you want to see object data for.
5. To see the data for a particular bucket, click the + icon next to each bucket for which you
want to see data.
To narrow the list of buckets, type the first few letters of the target bucket and click the
magnifying glass icon.
If you do not specify a bucket, the object metering data will be the totals for all buckets in
the namespace.
6. Click Apply to display the metering data for the selected namespace and bucket for the
specified time period.
Note: While all buckets in a geo-federation can be selected in metering, if a selected
bucket is not associated in a replication group to which the VDC that you are logged into
belongs, metering information cannot be retrieved for that bucket. In this case, after a
wait, the bucket is listed as No data. To get the metering information for the bucket,
log in to the VDC that owns the bucket or any VDC that is part of the replication group
to which the bucket belongs.
Depending on the Date Time Range selected, the attributes that are displayed in the
Metering Page may change. If Current option is selected, only Namespace, Buckets,
Bucket Tags, Total MPU Parts, Total MPU Size, Total Size, Object Count, and Last
Updated attributes are displayed in the table. If Custom or any other time range is
chosen, the Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, Total
Size, Object Count, Objects Created, Objects Deleted, Write Traffic and Read Traffic
attributes are displayed in the table and the Last Updated attribute is not displayed.
Metering data
Object metering data for a specified namespace, or a specified bucket within a namespace, can be
obtained for a defined time period at the ECS Portal Monitor > Metering page.
The metering information that is provided is shown in the following table:
Attribute Description
Buckets Bucket selected for which the metering data applies. If blank, the data is for
all buckets in the namespace.
Bucket Tags Lists any name=value bucket tags associated with the bucket.
Total MPU Parts The number of MPU parts that have been created and not used as part of a
complete MPU operation.
Total MPU Size The total disk size occupied by MPU parts that have been created and not
used as part of a complete MPU operation.
Total Size Total size of the objects that are stored in the selected namespace or bucket
at the end time that is specified in the filter. If the size is less than 1 GB, then
the portal displays 0GB.
Object Count Number of objects that are associated with the selected namespace or
bucket at the end time that is specified in the filter.
Last Updated If the Current filter is selected, Last Updated displays the time until which
metering data can be considered consistent. This can help you determine
any delay in reported metering stats. The metering stats may include some
data on the operations that are performed after the last updated time.
Objects Created Number of objects that are created in the selected namespace or bucket in
the time period.
Objects Deleted Number of objects that are deleted from the selected namespace or bucket
in the time period.
Write Traffic Total of incoming object data (writes) for the selected namespace or bucket
during the specified period. Values are displayed in a size unit that is based
on the size of the data.
Read Traffic Total of outgoing object data (reads) for the selected namespace or bucket
during the specified period. Values are displayed in a size unit that is based
on the size of the data.
Note: When you perform an update operation on an object, the metering services shows
Object Overwrite as Objects Created and Objects Deleted. The Objects Deleted
is shown because of the expected OVERWRITE behavior of an object. However, no object is
deleted.
Note: Metering is not a real-time reporting activity but is performed as a background process
and some delay in reporting can occur. The longest delay is about 15 minutes. However, where
the system is under heavy load, or is unstable, longer delays can be seen. If you are
encountering longer delays, contact ECS Customer Support.
Note: When there are many concurrent requests, ECS metering can ignore some requests so
that they do not impact system performance. Hence, the Write Traffic value can show less
that the actual Write bandwidth.
Read-only system
When the storage pool reaches 90% of its total capacity, it does not accept write requests and it
becomes a read-only system. A storage pool must have a minimum of four nodes and must have
three or more nodes with more than 10% free capacity in order to allow writes. This reserved
space is required to ensure that ECS does not run out of space while persisting system metadata.
If this criteria is not met, the write fails. The ability of a storage pool to accept writes does not
affect the ability of other pools to accept writes. For example, if you have a load balancer that
detects a failed write, the load balancer can redirect the write to another VDC.
Capacity forecast
You can use the Capacity tab to monitor when the capacity is expected to reach 50% and 80%.
Capacity forecast is based on the current usage pattern that is shown on 1 day, 7 days, and 30-
days usage trend. Capacity Forecast data is shown either for the entire VDC, for an individual
storage pool or for nodes.
Note: The capacity ETA shown as N/A could be due to the following reasons:
1. There is not enough historical data for forecast. At least two data points (1 hour apart) are
required. It could happen when the ECS system is deployed. Click the History button at
VDC, storage pool, or node levels to verify.
Monitor capacity
You can use the Capacity tab to view capacity utilization data for:
l VDC (VDC capacity utilization on page 19)
l Storage Pools (Storage pool capacity utilization on page 20)
l Nodes (Node capacity utilization on page 21)
l Disks (Disk capacity utilization on page 21)
l Used Capacity (Monitor used capacity on page 22)
You can view summary storage usage data about total, used, available, and reserved storage
capacity for storage pools and nodes.
Reserved capacity is the approximately 10 percent of the total capacity that is reserved for failure
handling and for performing erasure encoding or XOR operations. Reserved capacity is not
available for writing new data.
The tab opens with the Storage Pools capacity table displayed. To view capacity data for individual
nodes, click the appropriate link in the Nodes (Online) column to display the Nodes table. Click
the appropriate link in the Disks (Online) column to view capacity data for individual disks.
You can display average values over a selected date-time range or over a custom time range using
the Filter drop-down menu. The Current filter displays the latest available values and is the
default filter value.
When the table has the Date Time Range filter set to Current (the default setting), the table
displays the latest values and the history graphs display values over the last 24-hour period. When
the table has a Date Time Range filter applied (other than Current), it displays the average value
over that period.
VDC capacity utilization
Attribute Description
Per 1 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 1-day
usage trend.
Per 7 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 7-days
usage trend.
Per 30 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 30-days
usage trend.
Per 1 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 1-day
usage trend.
Attribute Description
Per 7 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 7-days
usage trend.
Per 30 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 30-days
usage trend.
Total Total capacity of the VDC that is online. This is the total of the capacity that
is already used and the capacity still free for allocation.
Available (Reserved) Online capacity available for use, including the approximately 10% of the
Note: If the Current filter is total capacity that is reserved for failure handling and for performing erasure
applied, Available (Reserved) encoding or XOR operations.
displays. If a filter other than
Current is applied, only Available
displays.
Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.
Attribute Description
Nodes (Online) Number of nodes in the storage pool followed by the number of those nodes
online. Click this number to open: Node capacity utilization on page 21.
Online Nodes with Sufficient Disk Number of online nodes that have sufficient disk space to accept new data.
Space If too many disks are too full to accept new data, the performance of the
Note: Does not appear if a filter system may be impacted.
other than Current is applied.
Disks (Online) Number of disks in the storage pool followed by the number of those disks
that are online.
Total Total capacity of the storage pool that is online. This is the total of the
capacity that is already used and the capacity still free for allocation.
Available (Reserved) Online capacity available for use, including the approximately 10% of the
Note: If the Current filter is total capacity that is reserved for failure handling and for performing erasure
applied, Available (Reserved) encoding or XOR operations.
displays. If a filter other than
Current is applied, only Available
displays.
Attribute Description
Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.
Attribute Description
Disks (Online) Number of disks that are associated with the node followed by the number
of those disks that are online. Click disk number to open: Disk capacity
utilization on page 21
Total Total online capacity provided by the online disks within the node. This is the
total of the capacity that is already used and the capacity still free for
allocation.
Available (Reserved) Remaining online capacity available in the node including reserved capacity.
Note: If the Current filter is
applied, Available (Reserved)
displays. If a filter other than
Current is applied, only Available
displays.
Online Status Indicates whether the node is online or offline. A check mark indicates that
the node status is Good.
Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.
Attribute Description
Online Status Indicates whether the disk is online or offline. The check mark indicates that
the disk status is Good.
Attribute Description
Actions History provides a graphic display of the data. If the Current filter (default)
is selected, the History button displays total, used, and available capacity for
the last 24 hours. History data is kept for 60 days.
User Data The capacity that is used for the repository chunks representing data uploaded
by ECS users.
System Metadata The capacity that is used by the ECS processes that track and describe the data
in the system.
Protection Overhead The combined overhead of triple mirroring and erasure coding for all user data,
system metadata, and geo data protection chunks protected locally.
Geo Cache The capacity used to cache chunks that are accessed locally but not stored
locally.
Geo Copy The capacity that is used for Geo-replication chunks stored on the current VDC.
Storage usage is shown as color-coded bars, one color for the current VDC, and a different color
for its storage pools. Tool tips for each colored bar correspond to the status information in the
numeric status line.
Attribute Description
Storage Type The VDC or storage pool for which to view garbage collection data.
Total Garbage Detected The amount of reclaimable storage capacity detected on the system.
Capacity Reclaimed The amount of storage capacity reclaimed by the garbage collection
process.
Capacity Pending Reclamation The amount of storage capacity that is identified as reclaimable but not
reclaimed yet.
UnReclaimable Garbage The amount of storage capacity that cannot be reclaimed currently.
Capacity Reclaimed
Click the Filter button to set a filter for the reclamation data by VDC or storage pool over a date/
time range.
Attribute Description
Storage Type The VDC or storage pool for which to view capacity reclaimed data.
Capacity Reclaimed The amount of storage capacity recovered following garbage collection.
Actions History provides a graphic display of the data. If the Current filter
(default) is selected, the History button displays the total reclaimed
capacity for the last 24 hours. History data is kept for 60 days.
Column Description
Total Coding Data The total logical size of all data chunks in the storage pool which are subject
to erasure encoding.
Total Coded Data The total logical size of all erasure-encoded chunks in the storage pool.
Coded (%) The percent of data in the storage pool that is erasure encoded. Percent
values display with three decimal places in the history chart for accurate
Column Description
plotting. Percent values display with two decimal points in the table,
consistent with the format of the other values in the table.
Coding Rate The rate at which any current data waiting for erasure encoding is being
processed.
Est. Time to Complete The estimated completion time extrapolated from the current erasure
encoding rate.
Actions l History provides a graphic display of the total coding data, total coded
data, percent of data coded, and coding rate per second. History data is
kept for 60 days.
l If the Current filter is selected, History displays default history for the
last 24 hours.
Attribute Description
Unreferenced Blob Data Detected The amount of unreferenced blob data in the bucket (in bytes).
Reflection Data Detected The amount of reflection data in the bucket (in bytes).
Actions History provides a graphic display of the unreferenced blob and reflection
data. If the Current filter (default) is selected, the History button displays
the data for the last 24 hours. History data is kept for 60 days.
l Removed: The disk is one that the system has completed recovery on and removed from the
storage engine's list of valid disks. History of all the removed disks will be displayed on ECS UI.
l Not Accessible: If a node is not accessible, then all its disks have this status. It indicates that
the actual status of the disk is not available to ECS.
Note: The Current filter displays the latest available values. A date-time range filter displays
average values over the specified range. Value data is kept for 60 days.
Procedure
1. Select Monitor > System Health and select the Hardware Health tab.
By default the Offline Nodes subtab displays. This table may be empty if all nodes are
online. Similarly, the Offline Data Disks subtab may be empty if all disks are online.
2. Select the Offline Nodes and Offline Data Disks subtabs to view a summary.
3. Select the All Nodes and Data Disks subtab to drill down to nodes and disks.
4. Click the node name to drill down to its disk health page.
Note: The Slot Info value always matches the physical slot ID in ECS U-Series, C-
Series, and D-Series Appliances. This makes Slot Info useful for quickly locating a disk
during disk replacement service. Some Certified Hardware installations with ECS
Software may not report useful or reliable data for Slot Info.
Note: Monitor the health of online and offline storage pool nodes and data disks. All data
disks that belong to the selected node are listed here. SSD Read Caches are not
included.
Avg. NIC Bandwidth VDC and Node Average bandwidth of the network interface
controller hardware that is used by the selected VDC
or node.
Avg. CPU Usage (%) VDC and Node Average percentage of the CPU hardware that is
used by the selected VDC or node.
Avg. Memory Usage VDC and Node Average usage of the aggregate memory available to
the VDC or node.
Relative NIC (%) VDC and Node Percentage of the available bandwidth of the network
interface controller hardware that is used by the
selected VDC or node.
Relative Memory (%) VDC and Node Percentage of the memory used relative to the
memory available to the selected VDC or node.
CPU Usage Process Percentage of the node's CPU used by the process.
The list of processes that are tracked is not the
complete list of processes running on the node. The
sum of the CPU used by the processes is not equal to
the CPU usage shown for the node.
Relative Memory (%) Process Percentage of the memory used relative to the
memory available to the process.
Last Restart Process The last time the process restarted on the node.
Process Description
Blob Service (blobsvc) Manages the following tables: Object (OB), Listing (LS), and Repo
Chunk Reference (RR).
Chunk Manager (cm) Manages the following tables: Chunk (CT), Btree Reference (BR).
Provides the logic to handle various events based on the chunk's
current state and decide which state to transition to next.
Directory Table Query (dtquery) Provides REST APIs to get Directory Table (DT) details.
GeoReceiver (georeceiver) Receives requests for chunks in the current VDC that are not owned
by the current VDC (secondary chunks). It then requests Chunk
Manager to start an operation to track the copy chunk creation and
select three replicas. The GeoReceiver process then writes the
datastream to the three instances. On successful completion, it directs
Chunk Manager to commit the copy chunk.
Head Service (headsvc) Manages object head protocols: S3, OpenStack Swift, EMC Atmos,
CAS, and HDFS.
Metering (metering) Manages the following tables: Metering Aggregate (MA) and Metering
Raw (MR).
Object Control Service (objcontrolsvc) Provides REST APIs for configuring the ECS cluster, managing ECS
resources, and monitoring the system.
Provision Service (provisionsvc) Manages the provisioning of storage resources and user access. It
handles user management, authorization, and authentication for all
Process Description
Resource Service (resourcesvc) Manages the following tables: Resource Table (RT) which handles
replication groups, buckets, users, namespace information and so on.
Storage Service Manager (ssm) Manages the following tables: Storage Space (SS) which contain disk
block usage and disk to chunk mapping. Interacts with one or more
Storage Servers and manages the active/free chunks on the
corresponding servers. Directs I/O operations to the disks.
Statistics Service (statsvc) Tracks various information on storage processes. These statistics can
be used to monitor the system.
See Advanced Monitoring, Process Health - by Nodes, Process Health - Overview and Process
Health - Process List by Nodefor details.
Monitor transactions
You can monitor requests and network performance for VDCs and nodes from the Monitor >
Transactions page.
Access the Transactions tab from the ECS Portal at Monitor > Transactions.
Note: When clicked Transactions, the Data Access Performance - Overview dashboard
opens in a new Grafana window.
The Transactions data can also be accessed from Advanced Monitoring > Data Access
Performance - Overview.
See Advanced Monitoring and Data Access Performance - Overview for details.
Column Description
Replication Group Lists the replication groups of which this VDC is a member. Click a
replication group to see a table of remote VDCs in the replication
group and their statistics. Click the Replication Groups link above the
table to return to the default view.
Write Traffic The current rate of writes to all remote VDCs or individual remote VDC
in the replication group.
Read Traffic The current rate of reads to all remote VDCs or individual remote VDC
in the replication group.
User Data Pending Replication The total logical size of user data waiting for replication for the
replication group or remote VDC.
Metadata Pending Replication The total logical size of metadata waiting for replication for the
replication group or remote VDC.
Data Pending XOR The total logical size of all data waiting to be processed by the XOR
compression algorithm in the local VDC for the replication group or
remote VDC.
Column Description
Remote Replication Group\Remote VDC At the VDC level, lists all remote replication groups of which the local
VDC is a member. At the replication group level, this column lists the
remote VDCs in the replication group.
Overall RPO The recent time period for which data might be lost in the event of a
local VDC failure.
Field Description
Replication Group Lists the replication groups that the local VDC is a member of.
Failed VDC Identifies a failed VDC that is part of the replication group.
Field Description
User Data Pending Re-replication When a VDC fails, user data chunks replicated to the failed VDC have
to be re-replicated to a different VDC. This field reports the logical
size of all user data (repository) chunks waiting re-replication to a
different VDC.
Metadata Pending Re-replication When a VDC fails, metadata chunks replicated to the failed VDC have
to be re-replicated to a different VDC. This field reports the logical
size of all metadata chunks waiting re-replication to a different VDC.
Data Pending XOR Decoding Shows the count and total logical size of chunks waiting to be
retrieved by the XOR compression scheme.
Failover Progress A percentage indicator for the overall status of the failover process.
Column Description
Replication Group This column provides the list of replication groups of which the local
VDC is a member and that are adding new VDCs. Each row provides
metrics for the specified replication group.
Added VDC The VDC being added to the specified replication group.
User Data Pending Replication The logical size of all user data (repository) chunks waiting for
replication to the new VDC.
Metadata Pending Replication The logical size of all system metadata waiting for replication to the
new VDC.
Column Description
Bootstrap Progress (%) The completion percent of the entire bootstrap process.
Cloud topology
You can use the Cloud topology summary information to see how the ECS system is making use of
hosted VDCs.
The Cloud > Topology page shows the hosted VDCs that are part of an ECS federated system,
and shows the relationship between the hosted VDC and any on-premise VDCs.
Cloud Hosted VDCs
The Cloud Hosted VDCs table shows the hosted VDCs that are present in the ECS system.
Currently ECS supports a single hosted site.
Related On-Premise VDCs
The Related On-Premise VDCs table shows the on-premise VDCs that are part of the ECS
federation.
Related Replication Groups
The Related Replication Groups table shows the replication groups that contain a storage pool
contributed by a selected hosted VDC. The Hosted VDC is selected in the Cloud Hosted VDC table.
A primary use case for using a hosted VDC is the Passive configuration in which the hosted VDC
provides a site for replication data but cannot be used as an active site by users. However, where
the active operation of the hosted VDC is allowed, the hosted VDC can be included in replication
groups where the type is Passive.
The table shows the replication group type and the VDC storage pools that are contributing to the
replication group, at least one of which will be a hosted VDC.
Attribute Description
Read Latency The average latency in milliseconds for reads from all replication groups
associated with the selected VDC.
Write Latency The average latency in milliseconds for writes to all replication groups
associated with the selected VDC.
Read Bandwidth The bandwidth utilized by reads from all replication groups associated with
the selected VDC.
Write Bandwidth The bandwidth utilized by writes from all replication groups associated with
the selected VDC.
Replication Groups
The Replication Groups tab shows each replication group and provides traffic data for a VDC for
each replication group that it contributes to. A VDC might have a storage pool that is in more than
one replication group, and this display allows you to see the traffic associated with each replication
group.
Attribute Description
Read Latency The average latency in milliseconds for reads from the selected VDC that
relate to the specified replication group.
Write Latency The average latency in milliseconds for writes to the selected VDC that
relate to the specified replication group.
Read Bandwidth The bandwidth utilized by reads from the from the selected VDC that relate
to the specified replication group.
Write Bandwidth The bandwidth utilized by writes to the selected VDC that relate to the
specified replication group.
Audit messages
List of the audit messages generated by ECS.
Fabric InstallerServiceOperation[kind=
INSTALLER_SERVICE_
OPERATION,
host=${hostName},
timestamp=${timestamp},
operationType=${operation},
args=${arguments of operation},
status=SUCCEEDED,
fqdn=${fqdn of host},
version=${installer version}]
Fabric NodeMaintenanceMode[kind=
NodeMaintenanceMode,
timestamp=${timestamp},
agentId=${agendId},
fqdn=${fqdn},
status=${MaintenanceStatus}]
Local user domain_group_mapping_updated_no_rol All roles of domain group ${resourceId} mapping have
es been removed
Local user domain_user_mapping_updated_no_role All roles of domain user ${resourceId} mapping have
s been removed
Local user local_user_created_no_roles Management user ${resourceId} without roles has been
created
Local user local_user_roles_updated_no_roles All roles of management user ${resourceId} have been
removed
NFS export_deleted Export with export path ${exportPath} has been deleted
User user_set_password New password has been set for object user $
{resourceId}
User user_set_metadata New metadata has been set for object user $
{resourceId}
User user_set_user_tag User Tag has been set for object user ${resourceId}
User user_delete_user_tag User Tag has been deleted for object user ${resourceId}
Monitor alerts
You can use the Monitor > Events > Alerts tab to view and manage system alerts.
About this task
See the list of alert messages.
Alert message Severity labels have the following meanings:
l Critical: Messages about conditions that require immediate attention
l Error: Messages about error conditions that report either a physical failure or a software failure
l Warning: Messages about less than optimal conditions
l Info: Routine status messages
Procedure
1. Select Alerts.
2. Optionally, click Filter.
3. Select your filters. The alerts filter adds filtering by Severity and Type, and an option to
Show Acknowledged Alerts, which retains the display of an alert even after it is
acknowledged by the user. When creating a custom date-time range, select Current Time
to use the current date and time as the end of your range.
Alert types must be entered exactly as described in the following table:
Quota Raised when soft or hard quota limits are exceeded (SoftQuotaLimitExceeded or
HardQuotaLimitExceeded) for a bucket or for a namespace.
RPO Raised when the recovery point objective (RPO) is greater than the RPO threshold.
Capacity Alerting Raised when the remaining capacity of the storage pool reaches a set threshold.
Capacity License Threshold Raised if the system capacity is greater than the licensed capacity.
TestDialHome Raised to test that ESRS connections can be established and that the call home
functionality works.
4. Select a Namespace.
5. Click Apply.
6. Next to each event, click the Acknowledge Alert button to acknowledge and dismiss the
message. Messages that have previously been acknowledged will display when the Show
Acknowledged Alerts filter option is selected, but the Acknowledge Alert button will not
be displayed for these rows.
7. You can click the Description of an alert, when it is formatted as a link, to be taken to a
relevant page in the portal.
Alert policy
Alert policies are created to alert about metrics, and are triggered when the specified conditions
are met. Alert policies are created per VDC.
You can use the Settings > Alerts Policy page to view alert policies.
There are two types of alert policy:
System alert policies
l System alert policies are precreated and exist in ECS during deployment.
l All the metrics have an associated system alert policy.
l System alert policies cannot be updated or deleted.
l System alert policies can be enabled/disabled.
l Alert is sent to the UI and all channels (SNMP, SYSLOG, and Secure Remote Services).
8. Select conditions.
You can set the threshold values and alert type with Conditions.
The alerts can be either a Warning Alert, Error Alert, or Critical Alert.
9. To add more conditions with multiple thresholds and with different alert levels, select Add
Condition.
10. Click Save.
To keep a record of the acknowledge all alerts request, a new informational alert of type
Bulk Alert Ack will be generated after the acknowledgment completes. Clear the filter and
manually refresh the table.
Alert messages
List of the alert messages that ECS uses.
Alert message Severity labels have the following meanings:
l Critical: Messages about conditions that require immediate attention
l Error: Messages about error conditions that report either a physical failure or a software failure
l Warning: Messages about less than optimal conditions
l Info: Routine status messages
Btree chunk Warning 1321 Portal, API, System metadata Event trigger source Contact ECS
level GC Secure garbage Remote Support
l Example:
Remote reclamation
Reclaimed Btree
Services, throughput is too
Garbage is less
SNMP Trap, slow to catch up
than 10% of the
Syslog with garbage
remaining BTree
detection.
garbage as BTree
GC is slow at
Chunk
reclamation.
Btree disk Warning 1325 Portal, API, Capacity free-up Event trigger source Contact ECS
level GC Secure throughput is too Remote Support.
l Example:
Remote slow to catch up
Reclaimed Btree
Services, with system
Garbage is less
SNMP Trap, metadata garbage
than 10% of the
Syslog reclamation.
Full garbage, as
BTree GC is slow
at disk level
reclamation.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula: if
Garbage_Pending
_Delete > 1TB, and
Garbage_Chunk_
Reclaim_Rate -
Garbage_Capacity
_Reclaim_Rate >
100GB
Btree Warning 1329 Portal, API, Partial GC for Event trigger source Contact ECS
partial GC Secure system metadata Remote Support.
l Example: Rate of
Remote is too slow.
Btree Partial GC
Services,
conversion to full
SNMP Trap,
Garbage is less
Syslog
than 10% of the
Partial GC eligible
for Conversion.
l Btree partial GC
works too slow to
convert partial
garbage into full
garbage.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula : If
Partial_Eligible_G
arbage > 1TB, and
Partial_To_Full_C
onvert_Rate <
100GB
Capacity Warning 1111 Portal, API, Storage pool The severity of the
alerting SNMP Trap, {Storage pool} alert depends on how
Error 1112 Syslog has {id}% close the remaining
remaining storage pool capacity
Critical 1113
capacity meeting is to reaching the
threshold of {id} configured threshold.
%. Capacity alerting is not
set by default: set
capacity alerts to
receive them. You can
set them by editing an
existing storage pool
or when you create a
storage pool.
Capacity Warning 1100 Portal, API, Used Capacity of The configured Contact ECS
exceeded Secure the VDC threshold is set at Remote Support
threshold Remote exceeded 80% of the Used representative to
Services, configured Capacity of the VDC determine the
SNMP Trap, threshold, current by default. appropriate
Syslog usage is {usage} CAUTION If the solution.
%. used capacity
reaches 90%, you
cannot write or
modify object
data.
Capacity Error 997 Portal, API, Licensed Capacity The capacity of the
license Secure Entitlement system is greater than
threshold Remote Exceeded Event was licensed.
Services,
Trap, Syslog
CPU Usage Warning 4001 Portal, API, CPU usage is $ If CPU usage percent
Percent SNMP Trap, {inspectorValue} crosses the threshold
Error 4002 Syslog % crosses specified then the
threshold $ alert is triggered.
Critical 4003
{thresholdValue}
%
Data Error 1500 Portal, ESRS, Data Migration Data migration has no
Migration SNMP Trap, has no movement progress for several
Blocked Syslog, for ${configured} hours.
SMTP hours for a device
and level (default
6 hours).
Note: Ignore the severity as Warning, for the Data Migration Finished alert. The severity is supposed to be Info.
Disabled Info 1316 Portal, API, CAS Processing is l CAS GC is Contact ECS
CAS GC Secure paused. Content Remote Support
Warning 1317 Remote Addressable representative to
Services, Storage Garbage determine the
Error 1318
SNMP Trap, Collection. appropriate
Critical 1319 Syslog solution.
l CAS GC is
disabled.
First Byte Warning 4009 Portal, API, First Byte If TTFB for read
Latency For SNMP Trap, Latency for Read latency crosses the
Read Error 4010 Syslog is $ threshold specified
{inspectorValue then the alert is
4011
}ms crosses triggered.
threshold $
{thresholdValue
}ms
Last Byte Warning 4003 Portal, API, Last Byte Latency If TTLB for write
Latency For SNMP Trap, for Write is $ latency crosses the
Write Error 4014 Syslog {inspectorValue threshold specified
}ms crosses then the alert is
Critical 4015
threshold $ triggered.
{thresholdValue
}ms
Read latency is
1050 millisecond,
crosses threshold
1000 millisecond.
Write latency is
1500 millisecond,
crosses threshold
1000 millisecond.
Process Error 1354 Portal, API, Memory table size Contact ECS
memory Secure for blob process is Remote Support.
table free Remote X % less than the
space Services, specified
percent SNMP Trap, threshold of Y %
Syslog on <node IP>.
Repo chunk Warning 1333 Portal, API, User garbage Event trigger source Contact ECS
level GC Secure collection Remote Support.
l Example: Repo
Remote throughput is too
Chunk reclamation
Services, slow to catch up
rate is less than
SNMP Trap, with garbage
10% of the
Syslog detection.
remaining
garbage.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula:
Full_Garbage >
10TB, and
Garbage_Detecte
d_Rate -
Garbage_Chunk_
Reclaim_Rate >
100GB
Repo disk Warning 1337 Portal, API, Capacity free-up Event trigger source Contact ECS
level GC Secure throughput is too Remote Support.
l Example: Repo
Remote slow to catch up
disk level GC
Repo partial Warning 1341 Portal, API, Partial GC for Event trigger source Contact ECS
GC Secure user garbage is Remote Support.
l Example: Repo
Remote too slow.
Partial repo GC
Services,
works too slow to
SNMP Trap,
convert partial
Syslog
garbage into full
garbage.
l This condition has
persisted for last 7
days, leading to
creation of this
alert.
l Derived from
formula: If
Partial_Eligible_G
arbage > 10TB,
and
Partial_To_Full_C
onvert_Rate <
100GB
RPO Warning 1012 Portal, API, RPO for The recovery point
Secure replication group objective (RPO) is
Remote {RG} is {HH} hour greater than the RPO
Services, {SS} seconds threshold. The default
Trap, Syslog greater than {HH} value is one hour.
hour threshold
set.
Slow CAS Info 1312 Portal, API, CAS Processing CAS GC cleanup tasks
GC Object Secure object cleanup are lagging.
Cleanup Warning 1313 Remote speed is slow.
Services,
Error 1314
SNMP, Trap,
Critical 1315 Syslog
Slow CAS Info 1308 Portal, API, CAS Processing CAS GC reference
GC Secure reference collection tasks are
Reference Warning 1309 Remote collection speed is lagging.
Collection Services, slow.
Error 1310
SNMP, Trap,
Critical 1311 Syslog
Slow Info 1304 Portal, API, Journal parsing Journal parsing speed
Journal Secure speed is slow. is slow.
Parsing Warning 1305 Remote
Services,
Error 1306
SNMP, Trap,
Critical 1307 Syslog
Space Warning 4005 Portal, API, Disk space usage If Disk usage percent
Usage SNMP, Trap, is $ crosses the threshold
Percent Error 4006 Syslog {inspectorValue} specified then the
% crosses alert is triggered.
Critical 4007
threshold $
{thresholdValue}
%
SSD Read Error 1392 Portal, API, SSD read cache SSD read cache fall
Cache Secure auto clean up back to memory cache
Capacity Remote failed when after clean up failed
Failure Services, capacity full and when capacity full.
SNMP, Trap, fall back to
Syslog memory cache.
Disk Ready Info 2061 Portal, API, Node SN={node Disk with
for SNMP Trap, sn} Disk SN=$ SUSPECT/BAD health
Replaceme Syslog, {disk sn} in is stopped using by
nt Secure rack={rack}, object service, is
Remote node={fqdn}, unmounted and is
Services slot={slot ready to be replaced.
number} is ready
for replacement.
Disk Details:
Type={disk type},
Model={vendor
model},
Size={disk size}
GB, Firmware=$
{firmware
version}.
Disk Failed Error 2062 Portal, API, Node SN={node Disk started to have
Replace SNMP Trap, sn} Disk SUSPECT/BAD
Process Syslog, SN={diskSerialNu health, Fabric started
Secure mber} in process to remove
Remote rack={rack}, that disk from usage,
Services node={fqdn}, but something went
slot={slot} cannot wrong.
be removed. Disk
Details: Type=
{disk type} ,
Model={Vendor
Model},
Size={size} GB,
Firmware={firmw
are}, reason:
{reason}
Disk added Info 2019 Portal, API, Disk Disk was added.
SNMP Trap, {diskSerialNumbe
Syslog r} on node {fqdn}
was added.
Disk failure Critical 2002 Portal, API, Disk SN= Health of disk that is
SNMP Trap, {diskSerialNumbe changed to BAD.
Syslog, r} on rack={rack},
Secure node= {fqdn} ,
Remote slot={slot
Services number} has
FAILED. Disk
Details:
Type={disk type},
Model='{VID
PID}', Size='{disk
size} GB',
Firmware={firmw
are version}"
Disk good Info 2025 Portal, API, Disk Disk was revived.
SNMP Trap, {diskSerialNumbe
Syslog r} on node {fqdn}
was revived.
Disk Error 2003 Portal, API, Disk SN= Health of disk that is
suspect SNMP Trap, {diskSerialNumbe changed to SUSPECT.
Syslog, r} on rack={rack},
Secure node= {fqdn} ,
Remote slot={slot
Services number} has
SUSPECTED.
Disk Details:
Type={disk type},
Model='{VID
PID}', Size='{disk
size} GB',
Firmware={firmw
are version}"
r} on node {fqdn}
has unmounted.
Firewall Bad 2051 Portal, API, Firewall health is Rules or ip sets do not
health is Secure BAD! {reason} exist, system firewall
BAD or Suspect 2052 Remote is off, ip tables or ip
SUSPECT Services, Firewall health is
set utils do not exist.
SNMP Trap, SUSPECT!
Syslog {reason} Rules or ip sets do not
exist, trying to
recover.
Fabric Error 2014 Portal, API, FabricAgent has Fabric agent health is
agent SNMP Trap, suspected on suspect.
suspect Syslog node {fqdn}.
Net Critical 2023 Portal, API, Net interface Fabric's net interface
interface SNMP Trap, {$netInterfaceNa is down.
Syslog, me}[ on node
Net Critical 2026 Portal, API, Net interface Net interface is down
interface Secure {$netInterfaceNa for at least 10 minutes.
permanent Remote me}[ on node
down Services $FQDN] is
permanently
down[ with IP
address $IP].
Net Info 2027 Portal, API, Net interface's Fabric's net interface
interface IP SNMP Trap, {netInterfaceNam IP address changed
address Syslog e} IP address on
updated node {fqdn} was
changed to
{newIpAddress}.
Node failure Critical 2006 Portal, API, Node {fqdn} has Node is not reachable
SNMP Trap, failed. for 30 minutes.
Syslog,
Secure
Remote
Services
Node up Info 2018 Portal, API, Node {fqdn} is up. Node moved to 'up'
SNMP Trap, state after it was
Syslog down for at least 15
minutes.
TestDialHome N/A TestDialHome Secure Tests that Secure Remote Services connections
Remote can be established and that the call home
Services functionality works.
l Advanced Monitoring............................................................................................................ 60
l Flux API................................................................................................................................. 74
l Dashboard APIs.....................................................................................................................95
Advanced Monitoring
Advanced Monitoring dashboards provide critical information about the ECS processes on the VDC
you are logged in to. The advanced monitoring dashboards are based on time series database, and
are provided by Grafana, which is well known open-source time series analytics platform.
Refer Grafana for basic details of navigation in Grafana dashboards.
Dashboard Description
Data Access Performance - You can use the Data Access Performance - Overview
Overview dashboard to monitor VDC data.
Data Access Performance - by You can use the Data Access Performance - by
Namespaces Namespaces dashboard to monitor performance data
for individual namespace or group of Namespaces.
Data Access Performance - by You can use the Data Access Performance - by Nodes
Nodes dashboard to see performance data for individual node
or group of nodes in a VDC.
Data Access Performance - by You can use the Data Access Performance - by
Protocols Protocols dashboard to see performance data for each
supported protocol (S3, ATMOS, SWIFT) or set of
protocols.
Disk Bandwidth - by Nodes You can use the Disk Bandwidth - by Nodes dashboard
to monitor the disk usage metrics by read or write
operations at the node level. The dashboard displays the
latest values.
Disk Bandwidth - Overview You can use the Disk Bandwidth - Overview dashboard
to monitor the disk usage metrics by read or write
operations at the VDC level.
Process Health - by Nodes You can use the Process Health - by Nodes dashboard
to monitor for each node of the VDC use of network
interface, CPU, and available memory. The dashboard
displays the latest values, and the history graphs display
values in the selected range.
Process Health - Overview You can use the Process Health - Overview dashboard
to monitor the VDC use of network interface, CPU, and
available memory. The dashboard displays the latest
Dashboard Description
Process Health - Process List by You can use the Process Health - Process List by
Node Node dashboard to monitor processes use of CPU,
memory, average thread number and last restart time in
the selected time range. The dashboard displays the
latest values in the selected time range.
Recovery Status You can use the Recovery Status dashboard to monitor
the data recovered by the system.
SSD Read Cache You can use the SSD Read Cache dashboard to monitor
total SSD disk capacity and disk space that is used by
SSD read cache.
Tech Refresh: Data Migration You can use the Tech Refresh: Data Migration
dashboard to monitor the data migration off and on a
node or cluster.
Top Buckets You can use the Top Buckets dashboard to monitor the
number of buckets with top utilization that is based on
total object size and count.
l Data Access Transaction Lists the total Successful requests, System Failures,
Performance - Summary User Failures, and Failure % Rate for the selected
Overview VDCs, namespaces, nodes, or protocols.
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access Performanc Lists the latest values of data access bandwidth and
Performance - e Summary latency of read/write requests for selected range.
Overview
l Data Access
Performance - by
Nodes
l Data Access Successful The number of data requests that were successfully
Performance - requests completed.
Overview
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access System The number of data requests that failed due to
Performance - Failures hardware or service errors. System failures are
Overview failed requests that are associated with hardware or
service errors (typically an HTTP error code of 5xx).
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access User The number of data requests from all object heads
Performance - Failures are classified as user failures. User failures are
Overview known error types originating from the object heads
(typically an HTTP error code of 4xx).
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access Failure % The percentage of failures for the VDC, namespace,
Performance - Rate nodes, or protocols.
Overview
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access TPS Rate of successful requests and failures per second.
Performance - (success/
Overview failure)
l Data Access
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access Failed Rate of failed requests per second, split by error
Performance - Requests/s type (user/system).
Overview by error
type (user/
l Data Access system)
Performance - by
Namespaces
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access Successful Displays the rate of successful requests per second,
Performance - request drill by method, node, and protocol.
Overview down
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access
Performance - by
Nodes
l Data Access Failures drill Displays the rate of failed requests per second, by
Performance - down method, node, and protocol.
Overview
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
l Data Access
Performance - by
Nodes
l Data Access Failed Rate of failed requests per second, by error code.
Performance - Requests/s
Overview by error
code
l Data Access
Performance - by
Nodes
l Data Access
Performance - by
Protocols
Data Access Performance Compare Select multiple nodes and compare rates of failed
- by Namespaces TPS of requests per second, by error type (user/system).
failed
requests
l Data Access Compare Select multiple nodes and compare data access
Performance - by read bandwidth (read) of successful requests per second.
Nodes bandwidth
l Data Access
Performance - by
Protocols
l Data Access Compare Select multiple nodes and compare data access
Performance - by write bandwidth (write) of successful requests per
Nodes bandwidth second.
l Data Access
Performance - by
Protocols
l Data Access Compare Select multiple nodes and compare latency of read
Performance - by read requests.
Nodes latency
l Data Access
Performance - by
Protocols
l Data Access Compare Select multiple nodes and compare latency of write
Performance - by write requests.
Nodes latency
l Data Access
Performance - by
Protocols
l Data Access Compare Select multiple nodes and compare rates of failed
Performance - by rate of requests per second, split by error type (user/
Nodes failed system).
requests/s
l Data Access
Performance - by
Protocols
Data Access Performance Request Rate of requests per second, split by node.
- by Namespaces drill down
by nodes
l Disk Bandwidth - by Read or Indicates whether the row describes read data or
Nodes Write write data.
l Disk Bandwidth -
Overview
l Disk Bandwidth - by Nodes The number of nodes in the VDC. You can click the
Nodes nodes number to see the disk bandwidth metrics for
each node. There is no Nodes column when you
l Disk Bandwidth - have drilled down into the Nodes display for a VDC.
Overview
l Disk Bandwidth - by Total Total disk bandwidth that is used for either read or
Nodes write operations.
l Disk Bandwidth -
Overview
l Disk Bandwidth -
Overview
l Disk Bandwidth -
Overview
l Disk Bandwidth - by XOR Rate at which disk bandwidth is used in the XOR
Nodes data protection operations of the system. XOR
operations occur for systems with three or more
l Disk Bandwidth - sites (VDCs).
Overview
l Disk Bandwidth - by Consistenc Rate at which disk bandwidth is used to check for
Nodes y Checker inconsistencies between protected data and its
replicas.
l Disk Bandwidth -
Overview
l Disk Bandwidth - by Geo Rate at which disk bandwidth is used to support geo
Nodes replication operations.
l Disk Bandwidth -
Overview
l Disk Bandwidth - by User Traffic Rate at which disk bandwidth is used by object
Nodes users.
l Disk Bandwidth -
Overview
Node Rebalancing Pending Amount of data that is in the rebalance queue but
Rebalancing has not been rebalanced yet.
Node Rebalancing Rate of The incremental amount of data that was rebalanced
Rebalance during a specific time period. The default time period
(per day) is one day.
Process Health - Process Process The last time the process restarted on the node in
List by Node Restarts the selected time range. The maximum time range
could be 5 days because it is limited by the retention
policy.
Process Health - Avg. CPU Average percentage of the CPU hardware that is
Overview Usage used by the selected VDC or node.
l Process Health - by CPU Usage Percentage of the node's CPU used by the process.
Nodes The list of processes that are tracked is not the
complete list of processes running on the node. The
l Process Health - sum of the CPU used by the processes is not equal
Process List by Node to the CPU usage shown for the node.
Process Health - Process Avg. # Average number of threads used by the process.
List by Node Thread
Process Health - Process Last The last time the process restarted on the node.
List by Node Restart
Recovery Status Amount of With the Current filter selected, this is the logical
Data to be size of the data yet to be recovered.
Recovered
l When a historical period is selected as the filter,
the meaning of Total Amount Data to be
Recovered is the average amount of data
pending recovery during the selected time.
l For example, if the first hourly snapshot of the
data showed 400 GB of data to be recovered in
a historical time period and every other
snapshot showed 0 GB waiting to be recovered,
the value of this field would be 400 GB divided
by the total number of hourly snapshots in the
period.
SSD Read Cache Disk Usage Used SSD space by Read Cache
Tech Refresh: Data Remaining This panel shows graph of remaining volume on
Migration Volume to source nodes.
Migrate
Tech Refresh: Data Migration This panel shows graph of remaining volume on
Migration Speed source nodes.
Top buckets Time of The time at which the displayed metrics of Top
Calculation Buckets dashboard were calculated.
View mode
Procedure
1. To view a dashboard in the view mode, click the title of a dashboard, for example (TPS
(success/failure) > View.
The dashboard opens in the view mode or in the full-screen mode.
2. Click Back to dashboard icon to return back to the dashboards view.
Export CSV
Procedure
1. To export the dashboard data to .csv format click the title of a dashboard, for example (TPS
(success/failure) > More > Export CSV.
The Export CSV window pops-up.
You can customize the csv output by modifying the Mode, Date Time Format, and check/
uncheck the Excel CSV Dialect attributes.
2. Click Export > Save to export the dashboard data to .csv format to your local storage.
l TPS (success/failure)
l Bandwidth (read/write)
l Failed Requests/s by error type (user/system)
l Latency
l Successful Requests/s by Node
l Failed Requests/s by Node
l Compare TPS of successful requests
l Compare TPS of failed requests
l Compare read bandwidth
l Compare write bandwidth
l Compare read latency
l Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select Advanced
Monitoring > Related dashboards > Data Access Performance - by Protocols.
Data for all the protocols are visible in the default view. To select data for a protocol, click the
legend parameter for the protocol below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests, compare
read/write bandwidth, compare read/write latency.
Node Rebalancing
You can use the Node Rebalancing dashboard to monitor the status of data rebalancing
operations when nodes are added to, or removed from, a cluster. Node rebalancing is enabled by
default at installation. Contact your customer support representative to disable or re-enable this
feature.
To view the Node Rebalancing dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Node Rebalancing
A series of interactive graphs shows that the amount of data rebalanced, pending rebalancing, and
the rate of rebalancing data in bytes over time.
Node rebalancing works only for new nodes that are added to the cluster.
Recovery Status
You can use the Recovery Status dashboard to see:
l The latest value of the logical size of the data yet to be recovered in the selected time range,
and
l History of the amount of data that is pending recovery in the selected time range.
To view the Recovery Status dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Recovery Status.
Top Buckets
ECS is upgraded with a mechanism in metering to calculate the number of buckets with top
utilization that is based on total object size and count.
Statistics of buckets with top utilization for the system is displayed in monitoring dashboards. The
number of buckets that are displayed on the monitoring dashboard is a configurable value.
To view the Top buckets dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Top buckets.
Flux API
Flux API enables you to retrieve time series database data by sending REST queries using curl. You
can get raw data from fluxd service in a way similar to using the Dashboard API. You have to get
a token, and provide the token in the requests.
Before you begin
Requires one of the following roles:
l SYSTEM_ADMIN
l SYSTEM_MONITOR
Request payload examples
json
{
"query": "from(bucket:\"monitoring_main\") |> range(start: -30m) |>
filter(fn: (r) => r._measurement ==
\"statDataHead_performance_internal_transactions\")"
}
query=from(bucket: "monitoring_main")
|> range(start: -30m)
|> filter(fn: (r) => r._measurement ==
"statDataHead_performance_internal_transactions")
Procedure
1. Generate a token.
Token
JSON example
"string",
"string",
"string",
"string",
"string",
"string"
],
"Columns": [
"table",
"_start",
"_stop",
"_time",
"_value",
"_field",
"_measurement",
"host",
"node_id",
"process",
"tag"
],
"Values": [
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T09:56:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T10:01:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T10:06:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],
CSV example
Information:
Measurement in this section have following structure:
Service is the name of ECS service that produces the measurement, i.e. blob,
cm, georcv, statDataHead.
For example,
blob_IO_Statistics_data_read
cm_IO_Statistics_data_write
Measurement: blob_IO_Statistics_data_read
...
Tags: host, node_id, process, tag
Fields: read_CCTotal (float, bytes)
read_ECTotal (float, bytes)
read_GEOTotal (float, bytes)
read_RECOVERTotal (float, bytes)
read_USERTotal (float, bytes)
read_XORTotal (float, bytes)
Measurement: blob_IO_Statistics_data_write
...
Tags: host, node_id, process, tag
Fields: write_CCTotal (integer)
write_ECTotal (integer)
write_GEOTotal (integer)
write_RECOVERTotal (integer)
write_USERTotal (integer)
write_XORTotal (integer)
Measurement: blob_SSDReadCache_Stats
Tags: host, id, last, node_id, process
Fields: +Inf (integer)
0.0 (integer)
1000.0 (integer)
25000.0 (integer)
5000.0 (integer)
rocksdb_disk_capacity_failure_counter (integer)
rocksdb_disk_usage_counter_bytes (integer)
rocksdb_disk_usage_percentage_counter (integer)
ssd_capacity_counter_bytes (integer)
CM statistics
These statistics represent processes in ECS service CM, such BTree GC, Chunk management,
Erasure coding.
Measurement: cm_BTREE_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_candidate_garbage_btree_gc_level_0 (integer)
accumulated_candidate_garbage_btree_gc_level_1 (integer)
accumulated_detected_data_btree_level_0 (integer)
accumulated_detected_data_btree_level_1 (integer)
accumulated_reclaimed_data_btree_level_0 (integer)
accumulated_reclaimed_data_btree_level_1 (integer)
candidate_chunks_btree_gc_level_0 (integer)
candidate_chunks_btree_gc_level_1 (integer)
candidate_garbage_btree_gc_level_0 (integer)
candidate_garbage_btree_gc_level_1 (integer)
copy_candidate_chunks_btree_gc_level_0 (integer)
copy_candidate_chunks_btree_gc_level_1 (integer)
copy_completed_chunks_btree_gc_level_0 (integer)
copy_completed_chunks_btree_gc_level_1 (integer)
copy_waiting_chunks_btree_gc_level_0 (integer)
copy_waiting_chunks_btree_gc_level_1 (integer)
deleted_chunks_btree_level_0 (integer)
deleted_chunks_btree_level_1 (integer)
deleted_data_btree_level_0 (integer)
deleted_data_btree_level_1 (integer)
full_reclaimable_chunks_btree_gc_level_0 (integer)
full_reclaimable_chunks_btree_gc_level_1 (integer)
reclaimed_data_btree_level_0 (integer)
reclaimed_data_btree_level_1 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_0 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_1 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_0 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_1 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_0 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_1 (integer)
verification_waiting_chunks_btree_gc_level_0 (integer)
verification_waiting_chunks_btree_gc_level_1 (integer)
Measurement: cm_Chunk_Statistics
Tags: host, node_id, process, tag
Fields: chunks_copy (integer)
chunks_copy_active (integer)
chunks_copy_s0 (integer)
chunks_level_0_btree (integer)
chunks_level_0_btree_active (integer)
chunks_level_0_btree_active_index_page (integer)
chunks_level_0_btree_active_leaf_page (integer)
chunks_level_0_btree_index_page (integer)
chunks_level_0_btree_leaf_page (integer)
chunks_level_0_btree_s0 (integer)
chunks_level_0_btree_s0_index_page (integer)
chunks_level_0_btree_s0_leaf_page (integer)
chunks_level_0_journal (integer)
chunks_level_0_journal_active (integer)
chunks_level_0_journal_s0 (integer)
chunks_level_1_btree (integer)
chunks_level_1_btree_active (integer)
chunks_level_1_btree_active_index_page (integer)
chunks_level_1_btree_active_leaf_page (integer)
chunks_level_1_btree_index_page (integer)
chunks_level_1_btree_leaf_page (integer)
chunks_level_1_btree_s0 (integer)
chunks_level_1_btree_s0_index_page (integer)
chunks_level_1_btree_s0_leaf_page (integer)
chunks_level_1_journal (integer)
chunks_level_1_journal_active (integer)
chunks_level_1_journal_s0 (integer)
chunks_repo (integer)
chunks_repo_active (integer)
chunks_repo_s0 (integer)
chunks_typeII_ec_pending (integer)
chunks_typeI_ec_pending (integer)
chunks_undertransform_ec_pending (integer)
chunks_xor (integer)
data_copy (integer)
data_level_0_btree (integer)
data_level_0_btree_index_page (integer)
data_level_0_btree_leaf_page (integer)
data_level_0_journal (integer)
data_level_1_btree (integer)
data_level_1_btree_index_page (integer)
data_level_1_btree_leaf_page (integer)
data_level_1_journal (integer)
data_repo (integer)
data_repo_copy (integer)
data_xor (integer)
data_xor_shipped (integer)
Measurement: cm_EC_Statistics
Tags: host, node_id, process, tag
Fields: chunks_ec_encoded (integer)
chunks_ec_encoded_alive (integer)
data_ec_encoded (integer)
data_ec_encoded_alive (integer)
Measurement: cm_Geo_Replication_Statistics_Geo_Chunk_Cache
Tags: host, node_id, process, tag
Fields: Capacity_of_Cache (integer)
Number_of_Chunks (integer)
Measurement: cm_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_deleted_garbage_repo (integer)
accumulated_reclaimed_garbage_repo (integer)
deleted_chunks_repo (integer)
deleted_data_repo (integer)
ec_freed_slots (integer)
full_reclaimable_aligned_chunk (integer)
merge_copy_overhead_in_deleted_data_repo (integer)
merge_copy_overhead_in_reclaimed_data_repo (integer)
reclaimed_chunk_repo (integer)
reclaimed_data_repo (integer)
slots_waiting_shipping (integer)
slots_waiting_verification (integer)
total_ec_free_slots (integer)
Measurement: cm_Rebalance_Statistics
Tags: host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)
Measurement: cm_Rebalance_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)
Measurement: cm_Recover_Statistics
Tags: host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)
Measurement: cm_Recover_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)
SR statistics
These statistics represent processes in ECS service SR, responsible for space reclamation.
Measurement: sr_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_merge_copy_overhead_in_full_garbage (integer)
accumulated_total_repo_garbage (integer)
full_reclaimable_repo_chunk (integer)
garbage_in_partial_sr_tasks (integer)
garbage_in_repo_usage (integer)
merge_copy_overhead_in_full_garbage (integer)
merge_way_gc_processed_chunks (integer)
merge_way_gc_src_chunks (integer)
merge_way_gc_targeted_chunks (integer)
merge_way_gc_tasks (integer)
total_repo_garbage (integer)
usage_between_0%_and_33.3%_repo_chunk (integer)
usage_between_33.3%_and_50%_repo_chunk (integer)
usage_between_50%_and_66.7%_repo_chunk (integer)
SSM statistics
These statistics represent processes in ECS storage manager service SSM.
Measurement: ssm_sstable_SSTable_SS
Tags: SS, SSTable, last, process, tag
Fields: allocatedSpace (integer)
availableFreeSpace (integer)
downDurationTotal (integer)
freeSpace (integer)
largeBlockAllocated (integer)
largeBlockAllocatedSize (integer)
largeBlockFreed (integer)
largeBlockFreedSize (integer)
pendingDurationTotal (integer)
pingerDurationTotal (integer)
smallBlockAllocated (integer)
smallBlockFreed (integer)
smallBlockFreedSize (integer)
smallBlockSize (integer)
state (string)
timeInStateTotal (integer)
totalSpace (integer)
upDurationTotal (integer)
Measurement: ssm_sstable_SSTable_SS_datamigration
Tags: SS, SSTable, last, process
Fields: status (integer)
totalCapacityToMigrate (integer)
Database monitoring_last
Service status, memory, and cache statistics
Measurement: blob_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: blob_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)
Measurement: cm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: eventsvc_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: mm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: resource_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: rm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: sr_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: sr_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)
Measurement: ssm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: dtquery_cmf
Tags: last, process
Fields: com.emc.ecs.chunk.gc.btree.enabled (integer)
com.emc.ecs.chunk.gc.btree.scanner.verification.enabled (integer)
com.emc.ecs.chunk.gc.repo.enabled (integer)
com.emc.ecs.chunk.gc.repo.verification.enabled (integer)
com.emc.ecs.chunk.rebalance.is_enabled (integer)
com.emc.ecs.objectgc.cas.enabled (integer)
com.emc.ecs.sensor.btree_sr_pending_mininum (integer)
com.emc.ecs.sensor.repo_sr_pending_mininum (integer)
Measurement: mm_topn_bucket_by_obj_count_place
Tags: last, place, process, tag
Fields: bucketName (string)
namespace (string)
value (integer)
Measurement: mm_topn_bucket_by_obj_size_place
Measurement: vnestStat_membership_ismember
Tags: host, ismember, last, node_id, process
Fields: is_leader (string)
Measurement: vnestStat_performance_latency_type
Tags: host, id, last, node_id, process, type
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
7999999.99999999 (integer)
825912.9477680004 (integer)
85266.52466135359 (integer)
8802.840841123942 (integer)
9.686250859269972 (integer)
908.7975284781536 (integer)
93.82345570870827 (integer)
Measurement: vnestStat_performance_transactions_from_type
Tags: from, host, last, node_id, process, type
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Database monitoring_op
Node system level statistics
Information:
Measurements listed in this section are from default Telegraf plugins. Here,
measurement name equals plugin name. Refer to plugin documentation for more
information.
For example, documentation for Telegraf plugin "cpu" can be found here.
Measurement: cpu
Tags: cpu, host, node_id, tag
Fields: usage_guest (float)
usage_guest_nice (float)
usage_idle (float)
usage_iowait (float)
usage_irq (float)
usage_nice (float)
usage_softirq (float)
usage_steal (float)
usage_system (float)
usage_user (float)
Measurement: disk
Tags: device, fstype, host, mode, node_id, path, tag
Fields: free (integer)
inodes_free (integer)
inodes_total (integer)
inodes_used (integer)
total (integer)
used (integer)
used_percent (float)
Measurement: diskio
Tags: ID_PART_ENTRY_UUID, SCSI_IDENT_SERIAL, SCSI_MODEL, SCSI_REVISION,
SCSI_VENDOR, host, name, node_id, tag
Fields: io_time (integer)
iops_in_progress (integer)
read_bytes (integer)
read_time (integer)
reads (integer)
weighted_io_time (integer)
write_bytes (integer)
write_time (integer)
writes (integer)
Measurement: linux_sysctl_fs
Tags: host, node_id, tag
Fields: aio-max-nr (integer)
aio-nr (integer)
dentry-age-limit (integer)
dentry-nr (integer)
dentry-unused-nr (integer)
dentry-want-pages (integer)
file-max (integer)
file-nr (integer)
inode-free-nr (integer)
inode-nr (integer)
inode-preshrink-nr (integer)
Measurement: mem
Tags: host, node_id, tag
Fields: active (integer)
available (integer)
available_percent (float)
buffered (integer)
cached (integer)
commit_limit (integer)
committed_as (integer)
dirty (integer)
free (integer)
high_free (integer)
high_total (integer)
huge_page_size (integer)
huge_pages_free (integer)
huge_pages_total (integer)
inactive (integer)
low_free (integer)
low_total (integer)
mapped (integer)
page_tables (integer)
shared (integer)
slab (integer)
swap_cached (integer)
swap_free (integer)
swap_total (integer)
total (integer)
used (integer)
used_percent (float)
vmalloc_chunk (integer)
vmalloc_total (integer)
vmalloc_used (integer)
wired (integer)
write_back (integer)
write_back_tmp (integer)
Measurement: net
Tags: host, interface, node_id, tag
Fields: bytes_recv (integer)
bytes_sent (integer)
bytes_sum (integer)
drop_in (integer)
drop_out (integer)
err_in (integer)
err_out (integer)
packets_recv (integer)
packets_sent (integer)
packets_sum (integer)
speed (integer)
utilization (integer)
Measurement: nstat
Tags: host, name, node_id, tag
Fields: IpExtInOctets (integer)
IpExtOutOctets (integer)
TcpInErrs (integer)
UdpInErrors (integer)
Measurement: processes
Tags: host, node_id, tag
Fields: blocked (integer)
dead (integer)
idle (integer)
paging (integer)
running (integer)
sleeping (integer)
stopped (integer)
total (integer)
total_threads (integer)
unknown (integer)
zombies (integer)
Measurement: procstat
Tags: host, node_id, process_name, tag, user
Fields: cpu_time (integer)
cpu_time_guest (float)
cpu_time_guest_nice (float)
cpu_time_idle (float)
cpu_time_iowait (float)
cpu_time_irq (float)
cpu_time_nice (float)
cpu_time_soft_irq (float)
cpu_time_steal (float)
cpu_time_stolen (float)
cpu_time_system (float)
cpu_time_user (float)
cpu_usage (float)
create_time (integer)
involuntary_context_switches (integer)
memory_data (integer)
memory_locked (integer)
memory_rss (integer)
memory_stack (integer)
memory_swap (integer)
memory_vms (integer)
nice_priority (integer)
num_fds (integer)
num_threads (integer)
pid (integer)
read_bytes (integer)
read_count (integer)
realtime_priority (integer)
rlimit_cpu_time_hard (integer)
rlimit_cpu_time_soft (integer)
rlimit_file_locks_hard (integer)
rlimit_file_locks_soft (integer)
rlimit_memory_data_hard (integer)
rlimit_memory_data_soft (integer)
rlimit_memory_locked_hard (integer)
rlimit_memory_locked_soft (integer)
rlimit_memory_rss_hard (integer)
rlimit_memory_rss_soft (integer)
rlimit_memory_stack_hard (integer)
rlimit_memory_stack_soft (integer)
rlimit_memory_vms_hard (integer)
rlimit_memory_vms_soft (integer)
rlimit_nice_priority_hard (integer)
rlimit_nice_priority_soft (integer)
rlimit_num_fds_hard (integer)
rlimit_num_fds_soft (integer)
rlimit_realtime_priority_hard (integer)
rlimit_realtime_priority_soft (integer)
rlimit_signals_pending_hard (integer)
rlimit_signals_pending_soft (integer)
signals_pending (integer)
voluntary_context_switches (integer)
write_bytes (integer)
write_count (integer)
Measurement: swap
Tags: host, node_id, tag
Fields: free (integer)
in (integer)
out (integer)
total (integer)
used (integer)
used_percent (float)
Measurement: system
Tags: host, node_id, tag
Fields: load1 (float)
load15 (float)
load5 (float)
n_cpus (integer)
n_users (integer)
uptime (integer)
uptime_format (string)
DT statistics
Measurement: dtquery_dt_dist_dt_node_id_type
Tags: dt_node_id, process, tag, type
Fields: count_i (integer)
Measurement: dtquery_dt_dist_host_dt_node_id
Tags: dt_node_id, process, tag
Fields: count_i (integer)
Measurement: dtquery_dt_dist_type_type
Tags: process, tag, type
Fields: count_i (integer)
Measurement: dtquery_dt_status
Tags: process, tag
Fields: total (integer)
unknown (integer)
unready (integer)
Measurement: dtquery_dt_status_detailed_type
Tags: process, tag, type
Fields: total (integer)
unknown (integer)
unready (integer)
Measurement: ecs_fabric_agent_dirstat_size_bytes
Tags: host, node_id, path, tag, url
Fields: gauge (float)
SR journal statistics
Measurement: sr_JournalParser_GC_RG_DT
Tags: DT, RG, last, process
Fields: majorMinorOfJournalRegion (string)
pendingChunks (integer)
timestampOfChunkRegion (string)
timestampOfJournalParserLastRun (string)
Measurement: sr_ObjectGC_CAS_RG
Tags: RG, last, process
Fields: STATUS (string)
Measurement: vnestStat_btree
Tags: cumulative_stats, host, level, node_id, tag
Fields: level_count (float)
page_count (float)
size_bytes (float)
Database monitoring_vdc
Metrics in this database are calculated values over whole VDC without reference to particular data
node.
Information:
Metrics below are aggregated over data nodes for raw measurements used in
Grafana ECS UI.
Measurement: cq_disk_bandwidth
Tags: type_op ('read', 'write')
Fields: consistency_checker (float)
erasure_encoding (float)
geo (float)
hardware_recovery (float)
total (float)
user_traffic (float)
xor (float)
Measurement: cq_node_rebalancing_summary
Tags: none
Fields: data_rebalanced (integer)
pending_rebalance (integer)
Measurement: cq_process_health
Tags: none
Fields: cpu_used (float)
mem_used (float)
mem_used_percent (float)
nic_bytes (float)
nic_utilization (float)
Measurement: cq_recover_status_summary
Tags: none
Fields: data_recovered (integer)
data_to_recover (integer)
Measurement: statDataHead_performance_internal_error
Tags: host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)
Measurement: statDataHead_performance_internal_error_code
Tags: code, host, node_id, process, tag
Fields: error_counter (integer)
Measurement: statDataHead_performance_internal_error_head
Tags: head, host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)
Measurement: statDataHead_performance_internal_error_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)
Measurement: statDataHead_performance_internal_latency
Tags: host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)
Measurement: statDataHead_performance_internal_latency_head
Tags: head, host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)
Measurement: statDataHead_performance_internal_throughput
Tags: host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)
Measurement: statDataHead_performance_internal_throughput_head
Tags: head, host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)
Measurement: statDataHead_performance_internal_transactions
Tags: host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_head
Tags: head, host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_method
Tags: host, method, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Database monitoring_vdc
Performance metrics in this database are calculated values over whole VDC without reference to
particular data node.
Most of values are:
l Rates (number of requests per seconds)- for all measurements not ending by "_delta"
l Delta values, increase of a counter from previous time stamp- for all measurements ending by
"_delta"
l Down sampled values (aggregated one point per day)- for all measurements ending by
"_downsampled"
Measurement: cq_performance_error
Tags: none
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_downsampled
Tags: none
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_code
Tags: code
Fields: error_counter (float)
Measurement: cq_performance_error_code_downsampled
Tags: code
Fields: error_counter (float)
Measurement: cq_performance_error_delta
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_delta_downsampled
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_head
Tags: head
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_head_downsampled
Tags: head
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_head_delta
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_head_delta_downsampled
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_ns
Tags: namespace
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_ns_downsampled
Tags: namespace
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_ns_delta
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_ns_delta_downsampled
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_latency
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_downsampled
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_head
Tags: head, id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_head_downsampled
Tags: head, id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_throughput
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_downsampled
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_head
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_head_downsampled
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_transaction
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_downsampled
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_delta
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_delta_downsampled
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_head
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_head_downsampled
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_head_delta
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_head_delta_downsampled
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_method
Tags: method
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_method_downsampled
Tags: method
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns
Tags: namespace
Measurement: cq_performance_transaction_ns_downsampled
Tags: namespace
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns_delta
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_ns_delta_downsampled
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
GET /dashboard/nodes/{id}/processes
GET /dashboard/processes/{id}
Flux API
Database:
l monitoring_op
Measurement:
l procstat(detailed info on available fields and tags https://github.com/influxdata/telegraf/
tree/master/plugins/inputs/procstat)
Fields:
l memory_rss- resident memory of a process (bytes)
l cpu_usage- cpu usage percentage for a process (percent used of a single cpu)
l num_threads- number of threads used by process (int)
Tags:
l process_name- valid process names:
n blobsvc
n cm
n coordinatorsvc
n dataheadsvc
n dtquery
n ecsportalsvc
n eventsvc
n georeceiver
n metering
n objcontrolsvc
n resourcesvc
n transformsvc
n vnest
n fluxd
n influxd
n throttler
n grafana-server
n dockerd
n fabric-agent
n fabric-lifecycle
n fabric-registry
n fabric-zookeeper
l host- hostname (fqdn)
l node_id- host id
Note:
r.node_id == "330e4b8f-4491-4ec7-b816-7b10ac9c6abf"
r.process_name == "cm"
Example query:
from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "procstat" and r._field ==
"memory_rss" and r.process_name == "vnest" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "process_name"])
Example output:
#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,process_name
,,0,2019-08-15T13:05:00Z,2505809920,vnest
,,0,2019-08-15T13:10:00Z,2505887744,vnest
,,0,2019-08-15T13:15:00Z,2506014720,vnest
,,0,2019-08-15T13:20:01Z,2506010624,vnest
Nodes statistics
Dashboard API
GET /dashboard/nodes/{id}
Database:
l monitoring_op
Measurement:
l cpu (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/
master/plugins/inputs/cpu)
Fields:
l usage_idle- idle cpu usage (percents)
Tags:
l host- hostname (fqdn)
l node_id- host id
Example query:
from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "cpu" and r.cpu == "cpu-total" and
r._field == "usage_idle" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])
Example output:
#datatype,string,long,dateTime:RFC3339,double,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,host
,,0,2019-08-15T13:20:00Z,19.549454477395525,host_name
,,0,2019-08-15T13:25:00Z,17.920104933062728,host_name
,,0,2019-08-15T13:30:00Z,18.050788903551002,host_name
,,0,2019-08-15T13:35:00Z,19.801364027505095,host_name
Measurement:
l mem (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/
master/plugins/inputs/mem)
Fields:
l free- free memory on host (bytes)
Tags:
l host- hostname (fqdn)
l node_id- host id
Example query:
from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "mem" and r._field == "free" and
r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])
Example output:
#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,host
,,0,2019-08-15T14:10:00Z,3181088768,host_name
,,0,2019-08-15T14:15:00Z,2988388352,host_name
,,0,2019-08-15T14:20:00Z,3002994688,host_name
,,0,2019-08-15T14:25:00Z,3115741184,host_name
Performance statistics
Dashboard API
GET /dashboard/nodes/{id}
GET /dashboard/zones/localzone
GET /dashboard/zones/localzone/nodes
Dashboard APIs
Lists the APIs that are deprecated.
APIs removed in ECS 3.5.0
The following table lists the APIs that are removed in ECS 3.5.0: