Core storage metrics

Metrics used for monitoring core storage are configured in the Prometheus recording rules and can be found in these files on any node in the cluster:

/var/lib/prometheus/rules/mdsd.rules
/var/lib/prometheus/rules/csd.rules
/var/lib/prometheus/rules/fused.rules
/var/lib/prometheus/rules/rjournal.rules

Metrics that are used to generate core storage alerts are added to the alerting rules in /var/lib/prometheus/alerts/pcs.rules. These metrics are described in the table:

Metric	Description
`fused_stuck_reqs_30s`	Number of stuck I/O requests on a node for more than 30 seconds
`fused_stuck_reqs_10s`	Number of stuck I/O requests on a node for more than 10 seconds
`fused_maps_failed`	Number of failed map requests on a node
`fused_map_failures_total`	Total number of failed map requests on a node
`fused_unaligned_writes:rate5m`	Number of unaligned write requests per second for 5 minutes
`fused_writes:rate5m`	Number of write requests per second for 5 minutes
`fused_unaligned_reads:rate5m`	Number of unaligned read requests per second for 5 minutes
`fused_reads:rate5m`	Number of read requests per second for 5 minutes
`mdsd_cluster_replication_stuck_chunks`	Number of chunks that block replication
`mdsd_cluster_replication_touts_total`	Total number of chunks that slow down replication
`mdsd_fs_chunk_maps`	Number of chunks in the storage cluster
`mdsd_fs_files`	Number of user-visible files in the storage cluster
`mdsd_fs_file_nodes`	Total number of files in the storage cluster
`master:mdsd_cs_status`	Chunk service status
`mdsd_cluster_free_space_bytes`	Amount of free physical space in the storage cluster
`mdsd_cluster_space_bytes`	Total amount of physical space in the storage cluster
`mdsd_is_master`	Node that runs the master metadata service
`mdsd_master_uptime`	Master metadata uptime
`instance_le:rjournal_commit_duration_seconds_bucket:rate5m`	Current commit latency by a particular metadata service for 5 minutes, for each bucket
`instance_csid:csd_journal_usage_ratio:rate5m`	Percentage of free space for a chunk service journal for 5 minutes
`process_cpu_seconds_total`	Total amount of time a process has used CPU
`process_swap_bytes`	Amount of swap space used by a process
`storage_policy_allocatable_space`	Amount of allocatable space per storage policy

The Prometheus recording rules also include metrics that are used for monitoring the following processes:

Replication is a process of restoring redundancy of data.
Re-encoding is a process of changing redundancy of files with erasure coding.
Rebalancing is a process that moves data from one place to another.

These metrics are described in the table:

Metric	Description
`mdsd_cluster_to_replicate_chunks`	Number of chunks that need to be replicated
`mdsd_cluster_replicated_chunks`	Total number of replicated chunks
`mdsd_cluster_replication_touts`	Total number of timed out replications
`mdsd_cluster_replication_stuck_chunks`	Number of chunks with last replication attempt failed
`mdsd_cluster_rebalance_pending_chunks`	Number of chunks that need to be rebalanced
`mdsd_enc_pending_files`	Number of files with re-encoding pending
`mdsd_enc_pending_bytes`	Estimated physical size of files to be re-encoded (excluding punch-holed data)
`mdsd_enc_pending_raw`	Estimated physical size of files to be re-encoded as a sum of sizes of involved chunks
`fused_ls_gc_reencoding_chunks`	Amount of chunks being re-encoded at this time
`fused_ls_gc_reencoded_bytes`	Total amount of data re-encoded