Core storage metrics and alerts

Metrics used for monitoring core storage are configured in the Prometheus recording rules and can be found in these files on any node in the cluster:

  • /var/lib/prometheus/rules/mdsd.rules
  • /var/lib/prometheus/rules/csd.rules
  • /var/lib/prometheus/rules/fused.rules
  • /var/lib/prometheus/rules/rjournal.rules

Metrics that are used to generate core storage alerts are added to the alerting rules in /var/lib/prometheus/alerts/pcs.rules. These metrics are described in the table:

Metric Description
fused_stuck_reqs_30s Number of stuck I/O requests on a node for more than 30 seconds
fused_stuck_reqs_10s Number of stuck I/O requests on a node for more than 10 seconds
fused_maps_failed Number of failed map requests on a node
fused_map_failures_total Total number of failed map requests on a node
fused_unaligned_writes:rate5m Number of unaligned write requests per second for 5 minutes
fused_writes:rate5m Number of write requests per second for 5 minutes
fused_unaligned_reads:rate5m Number of unaligned read requests per second for 5 minutes
fused_reads:rate5m Number of read requests per second for 5 minutes
mdsd_cluster_replication_stuck_chunks Number of chunks that block replication
mdsd_cluster_replication_touts_total Total number of chunks that slow down replication
job:mdsd_fs_chunk_maps:sum Number of chunks in the storage cluster
job:mdsd_fs_files:sum Number of files in the storage cluster
master:mdsd_cs_status Chunk service status
mdsd_cluster_free_space_bytes Amount of free physical space in the storage cluster
mdsd_cluster_space_bytes Total amount of physical space in the storage cluster
mdsd_is_master Node that runs the master metadata service
mdsd_master_uptime Master metadata uptime
instance_le:rjournal_commit_duration_seconds_bucket:rate5m Current commit latency by a particular metadata service for 5 minutes, for each bucket
instance_csid:csd_journal_usage_ratio:rate5m Percentage of free space for a chunk service journal for 5 minutes
process_cpu_seconds_total Total amount of time a process has used CPU
process_swap_bytes Amount of swap space used by a process

Based on the metrics above, the following core storage alerts are generated in Prometheus:

Title Message Severity
Cluster has more than 5 metadata services There are too many metadata services in the cluster. Each extra metadata service slows down operations with metadata. warning
Node has stuck I/O requests Some I/O requests are stuck on <node>. critical
Cluster has blocked or slow replication Chunk replication is blocked or too slow. critical
Swap space is used Swap space is used by <service_name> on <node>. warning
Node has failed map requests Some map requests on <node> have failed. critical
Cluster has too many chunks There are too many chunks in the cluster, which slows down the metadata service. warning
There are too many chunks in the cluster, which slows down the metadata service. critical
Cluster has too many files There are too many files in the cluster, which slows down the metadata service. warning
There are too many files in the cluster, which slows down the metadata service. critical
Metadata service has high CPU usage Metadata service on <node> has CPU usage higher than 80%. The service may be overloaded. warning
Metadata service has high commit latency Metadata service on <node> has the 95th percentile latency higher than 1 second. warning
Metadata service on <node> has the 95th percentile latency higher than 5 seconds. critical
Cluster has failed mount points Some mount points stopped working and need to be recovered. critical
Cluster has slow chunk services Some chunk services experience slowdown and degrade the cluster performance. warning
Cluster has offline chunk services Some chunk services are offline. Check and restart them. warning
Cluster has failed chunk services Some chunk services have failed. It may be caused by physical drive failure. warning
Cluster has unavailable metadata services Some metadata services are offline or have failed. Check and restart them. warning
Cluster has unaligned I/O writes I/O writes are not aligned. It may be caused by a wrongly formatted disk in a virtual machine. information
Cluster has unaligned I/O reads I/O reads are not aligned. It may be caused by a wrongly formatted disk in a virtual machine. information
Cluster is running out of physical space There is little free physical space left on each storage tier. warning
Cluster is out of physical space There is not enough free physical space on each storage tier. critical
Master metadata service changes too often Master metadata service has changed more than once in 5 minutes. warning
CS journal is running out of space CS journal has less than 20% of free space left. warning