Core storage alerts
Based on the metrics listed in Core storage metrics, the core storage alerts are generated and displayed in the admin panel.
Storage cluster alerts
-
Not enough cluster nodes
-
Cluster <cluster_name> has only {1,2} node(s) instead of the recommended minimum of 3. Add more nodes to the cluster.
-
Cluster is out of physical space
-
Cluster has just <free_space> TB (<free_space_in_percent>%) of physical storage space left. You may want to free some space or add more storage capacity.
-
Cluster is running out of physical space on tier
-
There is little free physical space left on storage tier <tier> (less than 20% of free space).
-
Cluster is out of physical space on tier
-
There is not enough free physical space on storage tier <tier> (less than 10% of free space).
-
Cluster has blocked or slow replication
-
Chunk replication is blocked or too slow.
-
Cluster has too many chunks
-
There are too many chunks in the cluster, which slows down the metadata service.
-
Cluster has critically high number of chunks
-
There are too many chunks in the cluster, which slows down the metadata service.
-
Cluster has too many files
-
There are too many files in the cluster, which slows down the metadata service.
-
Cluster has critically high number of files
-
There are too many files in the cluster, which slows down the metadata service.
-
Cluster has failed mount points
-
Some mount points stopped working and need to be recovered.
-
Cluster has unaligned I/O reads
-
I/O reads are not aligned. It may be caused by a wrongly formatted disk in a virtual machine.
-
Node has stuck I/O requests
-
Some I/O requests are stuck on <node>.
-
Node has failed map requests
-
Some map requests on <node> have failed.
Metadata service alerts
-
Only one metadata disk in cluster
-
Cluster has only one MDS. There is only one disk with the metadata role at the moment. Losing this disk will completely destroy all cluster data irrespective of the redundancy schema.
-
Not enough metadata disks
-
Cluster requires more disks with the metadata role. Losing one more MDS will halt cluster operation.
-
More than one metadata service per node
-
Node “<hostname>” has more than one metadata service located on it. It is recommended to have only one metadata service per node. Delete the extra metadata service(s) from this node and create them on other nodes instead.
-
Four metadata services in cluster
-
Cluster has four metadata services. This configuration slows down the cluster performance and does not improve its availability. For a cluster of four nodes, it is enough to configure three MDSes. Delete an extra MDS from one of the cluster nodes.
-
Over five metadata services in cluster
-
Cluster has more than five metadata services. This configuration slows down the cluster performance and does not improve its availability. For a large cluster, it is enough to configure five MDSes. Delete extra MDSes from the cluster nodes.
-
Metadata service has high CPU usage
-
Metadata service on <node> has CPU usage higher than 80%. The service may be overloaded.
-
Metadata service has high commit latency
-
Metadata service on <node> has the 95th percentile latency higher than 1 second.
-
Metadata service has critically high commit latency
-
Metadata service on <node> has the 95th percentile latency higher than 5 seconds.
-
Cluster has unavailable metadata services
-
Some metadata services are offline or have failed. Check and restart them.
-
Master metadata service changes too often
-
Master metadata service has changed more than once in 5 minutes.
Chunk service alerts
-
Not enough storage disks
-
Cluster requires more disks with the storage role to be able to provide the required level of redundancy.
-
CS has inconsistent encryption settings
-
Encryption is disabled for some CSs on tier <tier> but enabled for others on the same tier.
-
Storage disk is unresponsive
-
Disk <disk_name> (CS#<cs_id>) on node <hostname> is unresponsive. Check or replace this disk.
-
Cluster has offline chunk services
-
Some chunk services are offline. Check and restart them.
-
Cluster has failed chunk services
-
Some chunk services have failed. It may be caused by physical drive failure.
-
Number of CSes per device does not match configuration
-
Number of CSes per device on node <node> with ID <id> does not match configuration. Check your disk configuration.
-
CS has excessive journal size
-
The journal on CS#<cs_id> on host <hostname>, disk <disk_name>, is <value> MiB. The recommended size is 256 MiB.