Infrastructure alerts
The following infrastructure alerts are generated and displayed in the admin panel:
| Title | Message | Severity |
|---|---|---|
| License alerts | ||
| License is not loaded | License is not installed. | warning |
| License expired | The license of cluster “<cluster_name>” has expired. Сontact your reseller to update your license immediately! | critical |
| Cluster alerts | ||
| Cluster is out of space | Cluster has just <free_space> TB (<free_space_in_percent>%) of physical storage space left. You may want to free some space or add more storage capacity. | warning |
| Сluster “<cluster_name>” has run out of storage space allowed by license. No more data can be written. Please contact your reseller to update your license immediately! | warning | |
| Licensed storage capacity is low | Cluster has reached 80% of licensed storage capacity. | warning |
| Licensed storage capacity is critically low | Cluster has reached 90% of licensed storage capacity. | critical |
| Not enough cluster nodes | Cluster “<cluster_name>” has only {1,2} node(s) instead of the recommended minimum of 3. Add {2,1} or more nodes to the cluster. | warning |
| High availability for the admin panel must be configured | Configure high availability for the admin panel in Settings > Management node. Otherwise the admin panel will be a single point of failure. | critical |
| Management node backup does not exist | Management node backup is older than <number_of_days> days. | critical |
| The last management node backup has failed, does not exist, or is too old. | critical | |
| Changes to the management database are not replicated | Changes to the management database are not replicated to the node "<hostname>" because it is offline. Check the node's state and connectivity. | critical |
| Changes to the management database are not replicated to the node "<hostname>". Please contact the technical support. | ||
| Cluster connectivity alerts | ||
| Cluster network connectivity problem | All nodes have network connectivity problems: unstable connectivity via network "<network_name>" due to packet loss. | critical |
| All nodes have network connectivity problems: no connectivity via network "<network_name>". | critical | |
| Node network connectivity problem | Node "<hostname>" has network connectivity problems: unstable connectivity via network "<network_name>" due to the loss of all MTU-sized packets. | critical |
| Node "<hostname>" has network connectivity problems: unstable connectivity via network "<network_name>" due to the loss of some MTU-sized packets. | critical | |
| Node "<hostname>" has network connectivity problems: unstable connectivity via network "<network_name>" due to packet loss. | critical | |
| Node "<hostname>" has network connectivity problems: no connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>". | critical | |
| Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to the loss of all MTU-sized packets. | critical | |
| Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to packet loss. | critical | |
| Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to the loss of some MTU-sized packets. | critical | |
| MTU mismatch | Some interfaces have MTU that differs from other interfaces in the same network: network "<network_name>" interface@host "<iface>@<hostname>". | critical |
| Node alerts | ||
| Node is offline | Node “<hostname>” is offline. | warning |
| Node got offline too many times | Node “<hostname>” got offline too many times last hour. | warning |
| Kernel is outdated | Node “<hostname>” is not running the latest kernel. | warning |
| OOM killer triggered | OOM killer has been triggered on node “<hostname>”. | warning |
| Time is not synced | Time on node “<hostname>” differs from time on backend node by more than 5 seconds. | warning |
| No Internet access | Cluster node <hostname> cannot reach the repository. Make sure that all cluster nodes have Internet access. | warning |
| Incompatible hardware detected | Incompatible hardware detected on node "<hostname>": <hardware_list>. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled. | critical |
| Swap space is running low | <swap_proportion>% of swap is used on node "<hostname>". | critical |
| Node has high CPU usage | Node <hostname> has CPU usage higher than 90%. The current value is <value>%. | warning |
| Node has high memory usage | Node <hostname> has memory usage higher than 95%. The current value is <value>%. | warning |
| Node has high disk I/O usage | Disk /dev/<disk_name> on node <hostname> has I/O usage higher than 85%. The current value is <value>%. | warning |
| Node has high receive packet loss rate | Node <hostname> has <value> receive packet loss rate reported by job <job_name>. | warning |
| Node has high transmit packet loss rate | Node <hostname> has <value> transmit packet loss rate reported by job <job_name>. | warning |
| Node has high receive packet error rate | Node <hostname> has <value> receive packet error rate reported by job <job_name>. | warning |
| Node has high transmit packet error rate | Node <hostname> has <value> transmit packet error rate reported by job <job_name>. | warning |
| Disk alerts | ||
| S.M.A.R.T. warning | Disk “<disk_name>”(<serial>) on node “<hostname>” has failed a S.M.A.R.T. check. | critical |
| Disk error | Disk “<disk_name>” (<serial>) failed on node “<hostname>”. | critical |
| Disk is out of space | Root partition on node “<hostname>” is running out of space (less than 10% of free space). | warning |
| Disk write cache is enabled | Disk write cache is enabled for disk “<disk_name>” on node “<hostname>”. Disable it to avoid potential data loss in case of a power outage. | warning |
| Disk write cache status unknown | Cannot determine the status of write cache for disk “<disk_name>” on node “<hostname>”. | warning |
| Software RAID is not fully synced | Software RAID <disk_name> on node <hostname> is <value>% synced. | warning |
| Systemd service is flapping | Systemd service <service_name> on node <hostname> has changed its state more than 5 times in 5 minutes or 15 times in one hour. | critical |
| Network alerts | ||
| Network warning | Network interface “<iface_name>” has incorrect settings: <duplex> duplex and <speed> speed. | warning |
| Network interface “<iface_name>” on node “<hostname>” is missing important features (or has them disabled): “<feature_name>”. | warning | |
| Network interface “<iface_name>” on node “<hostname>” is not in the full duplex mode. | warning | |
| Network interface “<iface_name>” on node “<hostname>” has speed lower than the minimally required 1 Gbps. | warning | |
| Network interface “<iface_name>” on node “<hostname>” has an undefined speed. | warning | |
| Network interface is flapping | Network interface <iface_name> on node <hostname> is flapping. | warning |
| Network bond is not redundant | Network bond <iface_name> on node <hostname> is missing <number_of_ifaces> subordinate interface(s). | critical |
| Update alerts | ||
| Software updates exist | Software updates exist for the node <hostname>. Current version: <current_version>. Available version: <available_version>. | information |
| Update check failed | Update check failed on the node <hostname>. Please check access to the update repository. | warning |
| Multiple update checks failed | Update checks failed multiple times on the node <hostname>. Please check access to the update repository. | critical |
| Update download failed | Update download failed on the node <hostname>. | critical |
| Node update failed | Software update failed on the node <hostname>. | critical |
| Update failed | Update failed for the management panel and compute API. | critical |
| Cluster update failed | Update failed for the cluster. | critical |
| Entering maintenance for update failed | Entering maintenance failed while updating the node <hostname>. | critical |
| Service alerts | ||
| Compute cluster has failed | Compute cluster has failed. Unable to manage virtual machines. | critical |
| Certificate expiration | Acronis Backup Gateway certificate has expired. All backup operations have been stopped. Update the certificate on the Backup Gateway screen. | critical |
| Acronis Backup Gateway certificate will expire soon. Update the certificate on the Backup Gateway screen. | warning | |
| Acronis Backup Gateway certificate will expire on "<expiration_date>". Update the certificate on the Backup Gateway screen. | ||
| Redundancy warning | iSCSI LUN <lun_id> of target group “<target_group>” is set to failure domain “disk” even though <number_of_nodes> nodes are available. It is recommended to set the failure domain to “host” so that the LUN can survive host failures in addition to disk failures. | warning |
| iSCSI major upgrade failed | iSCSI major upgrade failed. Will be retried… | critical |
| NFS service has unavailable FS services | Some File services are not running on <node>. Check the service status in the command-line interface. | warning |