Core storage metrics and alerts

Metrics used for monitoring core storage are configured in the Prometheus recording rules and can be found in these files on any node in the cluster:

/var/lib/prometheus/rules/mdsd.rules
/var/lib/prometheus/rules/csd.rules
/var/lib/prometheus/rules/fused.rules
/var/lib/prometheus/rules/rjournal.rules

Metrics that are used to generate core storage alerts are added to the alerting rules in /var/lib/prometheus/alerts/pcs.rules. These metrics are described in the table:

Metric	Description
`fused_stuck_reqs_30s`	Number of stuck I/O requests on a node for more than 30 seconds
`fused_stuck_reqs_10s`	Number of stuck I/O requests on a node for more than 10 seconds
`fused_maps_failed`	Number of failed map requests on a node
`fused_map_failures_total`	Total number of failed map requests on a node
`fused_unaligned_writes:rate5m`	Number of unaligned write requests per second for 5 minutes
`fused_writes:rate5m`	Number of write requests per second for 5 minutes
`fused_unaligned_reads:rate5m`	Number of unaligned read requests per second for 5 minutes
`fused_reads:rate5m`	Number of read requests per second for 5 minutes
`mdsd_cluster_replication_stuck_chunks`	Number of chunks that block replication
`mdsd_cluster_replication_touts_total`	Total number of chunks that slow down replication
`job:mdsd_fs_chunk_maps:sum`	Number of chunks in the storage cluster
`job:mdsd_fs_files:sum`	Number of files in the storage cluster
`master:mdsd_cs_status`	Chunk service status
`mdsd_cluster_free_space_bytes`	Amount of free physical space in the storage cluster
`mdsd_cluster_space_bytes`	Total amount of physical space in the storage cluster
`mdsd_is_master`	Node that runs the master metadata service
`mdsd_master_uptime`	Master metadata uptime
`instance_le:rjournal_commit_duration_seconds_bucket:rate5m`	Current commit latency by a particular metadata service for 5 minutes, for each bucket
`instance_csid:csd_journal_usage_ratio:rate5m`	Percentage of free space for a chunk service journal for 5 minutes
`process_cpu_seconds_total`	Total amount of time a process has used CPU
`process_swap_bytes`	Amount of swap space used by a process

Based on the metrics above, the following core storage alerts are generated in Prometheus:

Title	Message	Severity
Cluster has more than 5 metadata services	There are too many metadata services in the cluster. Each extra metadata service slows down operations with metadata.	warning
Node has stuck I/O requests	Some I/O requests are stuck on <node>.	critical
Cluster has blocked or slow replication	Chunk replication is blocked or too slow.	critical
Swap space is used	Swap space is used by <service_name> on <node>.	warning
Node has failed map requests	Some map requests on <node> have failed.	critical
Cluster has too many chunks	There are too many chunks in the cluster, which slows down the metadata service.	warning
Cluster has too many chunks	There are too many chunks in the cluster, which slows down the metadata service.	critical
Cluster has too many files	There are too many files in the cluster, which slows down the metadata service.	warning
Cluster has too many files	There are too many files in the cluster, which slows down the metadata service.	critical
Metadata service has high CPU usage	Metadata service on <node> has CPU usage higher than 80%. The service may be overloaded.	warning
Metadata service has high commit latency	Metadata service on <node> has the 95th percentile latency higher than 1 second.	warning
Metadata service has high commit latency	Metadata service on <node> has the 95th percentile latency higher than 5 seconds.	critical
Cluster has failed mount points	Some mount points stopped working and need to be recovered.	critical
Cluster has slow chunk services	Some chunk services experience slowdown and degrade the cluster performance.	warning
Cluster has offline chunk services	Some chunk services are offline. Check and restart them.	warning
Cluster has failed chunk services	Some chunk services have failed. It may be caused by physical drive failure.	warning
Cluster has unavailable metadata services	Some metadata services are offline or have failed. Check and restart them.	warning
Cluster has unaligned I/O writes	I/O writes are not aligned. It may be caused by a wrongly formatted disk in a virtual machine.	information
Cluster has unaligned I/O reads	I/O reads are not aligned. It may be caused by a wrongly formatted disk in a virtual machine.	information
Cluster is running out of physical space	There is little free physical space left on each storage tier.	warning
Cluster is out of physical space	There is not enough free physical space on each storage tier.	critical
Master metadata service changes too often	Master metadata service has changed more than once in 5 minutes.	warning
CS journal is running out of space	CS journal has less than 20% of free space left.	warning