Core storage metrics and alerts
Metrics used for monitoring core storage are configured in the Prometheus recording rules and can be found in these files on any node in the cluster:
- /var/lib/prometheus/rules/mdsd.rules
- /var/lib/prometheus/rules/csd.rules
- /var/lib/prometheus/rules/fused.rules
- /var/lib/prometheus/rules/rjournal.rules
Metrics that are used to generate core storage alerts are added to the alerting rules in /var/lib/prometheus/alerts/pcs.rules. These metrics are described in the table:
Metric | Description |
---|---|
fused_stuck_reqs_30s
|
Number of stuck I/O requests on a node for more than 30 seconds |
fused_stuck_reqs_10s
|
Number of stuck I/O requests on a node for more than 10 seconds |
fused_maps_failed
|
Number of failed map requests on a node |
fused_map_failures_total
|
Total number of failed map requests on a node |
fused_unaligned_writes:rate5m
|
Number of unaligned write requests per second for 5 minutes |
fused_writes:rate5m
|
Number of write requests per second for 5 minutes |
fused_unaligned_reads:rate5m
|
Number of unaligned read requests per second for 5 minutes |
fused_reads:rate5m
|
Number of read requests per second for 5 minutes |
mdsd_cluster_replication_stuck_chunks
|
Number of chunks that block replication |
mdsd_cluster_replication_touts_total
|
Total number of chunks that slow down replication |
job:mdsd_fs_chunk_maps:sum
|
Number of chunks in the storage cluster |
job:mdsd_fs_files:sum
|
Number of files in the storage cluster |
master:mdsd_cs_status
|
Chunk service status |
mdsd_cluster_free_space_bytes
|
Amount of free physical space in the storage cluster |
mdsd_cluster_space_bytes
|
Total amount of physical space in the storage cluster |
mdsd_is_master
|
Node that runs the master metadata service |
mdsd_master_uptime
|
Master metadata uptime |
instance_le:rjournal_commit_duration_seconds_bucket:rate5m
|
Current commit latency by a particular metadata service for 5 minutes, for each bucket |
instance_csid:csd_journal_usage_ratio:rate5m
|
Percentage of free space for a chunk service journal for 5 minutes |
process_cpu_seconds_total
|
Total amount of time a process has used CPU |
process_swap_bytes
|
Amount of swap space used by a process |
Based on the metrics above, the following core storage alerts are generated in Prometheus:
Title | Message | Severity |
---|---|---|
Cluster has more than 5 metadata services | There are too many metadata services in the cluster. Each extra metadata service slows down operations with metadata. | warning |
Node has stuck I/O requests | Some I/O requests are stuck on <node>. | critical |
Cluster has blocked or slow replication | Chunk replication is blocked or too slow. | critical |
Swap space is used | Swap space is used by <service_name> on <node>. | warning |
Node has failed map requests | Some map requests on <node> have failed. | critical |
Cluster has too many chunks | There are too many chunks in the cluster, which slows down the metadata service. | warning |
There are too many chunks in the cluster, which slows down the metadata service. | critical | |
Cluster has too many files | There are too many files in the cluster, which slows down the metadata service. | warning |
There are too many files in the cluster, which slows down the metadata service. | critical | |
Metadata service has high CPU usage | Metadata service on <node> has CPU usage higher than 80%. The service may be overloaded. | warning |
Metadata service has high commit latency | Metadata service on <node> has the 95th percentile latency higher than 1 second. | warning |
Metadata service on <node> has the 95th percentile latency higher than 5 seconds. | critical | |
Cluster has failed mount points | Some mount points stopped working and need to be recovered. | critical |
Cluster has slow chunk services | Some chunk services experience slowdown and degrade the cluster performance. | warning |
Cluster has offline chunk services | Some chunk services are offline. Check and restart them. | warning |
Cluster has failed chunk services | Some chunk services have failed. It may be caused by physical drive failure. | warning |
Cluster has unavailable metadata services | Some metadata services are offline or have failed. Check and restart them. | warning |
Cluster has unaligned I/O writes | I/O writes are not aligned. It may be caused by a wrongly formatted disk in a virtual machine. | information |
Cluster has unaligned I/O reads | I/O reads are not aligned. It may be caused by a wrongly formatted disk in a virtual machine. | information |
Cluster is running out of physical space | There is little free physical space left on each storage tier. | warning |
Cluster is out of physical space | There is not enough free physical space on each storage tier. | critical |
Master metadata service changes too often | Master metadata service has changed more than once in 5 minutes. | warning |
CS journal is running out of space | CS journal has less than 20% of free space left. | warning |