Infrastructure alerts

The infrastructure alerts that are generated and displayed in the admin panel can be related to the management node and admin panel, license and updates, cluster connectivity, as well as to a specific node, its disks, and network interfaces.

Management node and admin panel alerts

High availability for the admin panel must be configured

Configure high availability for the admin panel in Settings > System settings > Management node high availability. Otherwise the admin panel will be a single point of failure.

Management node backup does not exist

  • The last management node backup has failed or does not exist!

  • Management node backup is old

    Management node backup is older than <value> days.

    Management node backup is too old

    Management node backup is older than <value> days.

    Changes to the management database are not replicated

    Changes to the management database are not replicated to the node "<hostname>" because it is offline. Check the node's state and connectivity.

    Changes to the management database are not replicated

    Changes to the management database are not replicated to the node "<hostname>". Please contact the technical support.

    Management node HA has four nodes

    The management node HA configuration has four nodes. It is recommended to have three or five nodes included.

    Management panel SSL certificate will expire in less than 30 days

    Management panel SSL certificate will expire in less than 7 days

    Management panel SSL certificate has expired

    License alerts

    License is not loaded

    License is not updated

    License will expire soon

    License expired

    Licensed core limit exceeded

    Licensed storage capacity is low

    Licensed storage capacity is critically low

    Cluster is out of licensed space

    Update alerts

    Software updates exist

    Software updates exist for the Management panel and compute API

    Update check failed

    Multiple update checks failed

    Update download failed

    Node update failed

    Update failed

    Cluster update failed

    Entering maintenance for update failed

    Cluster connectivity alerts

    Network connectivity failed

    No network traffic has been detected via network "<network_name>" from all nodes.

    Node network connectivity problem

    Node "<hostname>" has no network connectivity to node "<hostname>" via network "<network_name>".

    Node network packet loss

    Node "<hostname>" has a problem with network connectivity to node "<hostname>" via network "<network_name>" due to the loss of some packets.

    Node network persistent packet loss

    Node "<hostname>" has a problem with network connectivity to node "<hostname>" via network "<network_name>" due to the persistent loss of some packets over the last two hours.

    Node network unstable connectivity

    Node "<hostname>" has a problem with network connectivity to node "<hostname>" via network "<network_name>" due to the loss of all MTU-sized packets.

    Node network MTU packet loss

    Node "<hostname>" has a problem with network connectivity to node "<hostname>" via network "<network_name>" due to the loss of some MTU-sized packets.

    Node network persistent MTU packet loss

    Node "<hostname>" has a problem with network connectivity to node "<hostname>" via network "<network_name>" due to the persistent loss of some MTU-sized packets over the last two hours.

    MTU mismatch

    Network <network_name> has assigned interfaces with different MTU.

    Node alerts

    Node is offline

    Node "<hostname>" is offline.

    Node got offline too many times

    Node "<hostname>" got offline too many times for the last hour.

    Kernel is outdated

    Node "<hostname>" is not running the latest kernel.

    OOM killer triggered

    OOM killer has been triggered on node "<hostname>".

    Time not synced

    Time on node "<hostname>" differs from time on backend node by more than <value_5<30> seconds.

    Node time critically unsynced

    Time on node <hostname> is critically unsynced, differing from the time on backend node by more than <value_>30> seconds.

    Node has no internet access

    Node "<hostname>" cannot reach the repository. Ensure the node has a working internet connection.

    Incompatible hardware detected

    Incompatible hardware detected on node "<hostname>": <hardware_list>. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled. Visit https://support.virtuozzo.com/hc/en-us/articles/19764365143953 to learn how to troubleshoot this issue.

    Node has high CPU usage

    Node <hostname> has CPU usage higher than 90%. The current value is <value>%.

    Node has high memory usage

    Node <hostname> has memory usage higher than 95%. The current value is <value>%.

    Node has high disk I/O usage

    Disk /dev/<disk_name> on node <hostname> has I/O usage higher than 85%. The current value is <value>%.

    Node has high swap usage

    Node <hostname> has swap usage higher than 40%. The current value is <value>%.

    Node has critically high swap usage

    Node <hostname> has critically high swap usage, exceeding 80%. The current value is <value>%.

    Node has high receive packet loss rate

    Node <hostname> has <value> receive packet loss rate reported by job <job_name>.

    Node has high transmit packet loss rate

    Node <hostname> has <value> transmit packet loss rate reported by job <job_name>.

    Node has high receive packet error rate

    Node <hostname> has <value> receive packet error rate reported by job <job_name>.

    Node has high transmit packet error rate

    Node <hostname> has <value> transmit packet error rate reported by job <job_name>.

    Reached "node crash per hour" threshold

    Node <hostname> with shaman node ID <id> has reached the "node crash per hour" threshold.

    OOM-kill event detected

    OOM-kill event detected on node <hostname> at least once for the last 24 hours. You need to check memory consumption.

    Node failed to return to operation

    Node <hostname> has failed to automatically return to operation within 30 minutes after a crash. Check the node's hardware, and then try returning it to operation manually.

    Node crash detected

    Node <hostname> crashed, which started the VM evacuation.

    Disk alerts

    Disk SMART warning

    Disk "<disk_name>" (<serial>) on node "<hostname>" has failed a S.M.A.R.T. check.

    Disk error

    Disk "<disk_name>" (<serial>) has failed on node "<hostname>".

    Disk is running out of space

    Root partition on node "<hostname>" is running out of space (less than 10% or 5 GB of free space).

    Disk is out of space

    Root partition on node "<hostname>" is running out of space (for non-compute nodes: less than 5% of free space).

    Compute node disk is out of space

    Root partition on node "<hostname>" is running out of space (for compute nodes: less than 1 GB of free space).

    Software RAID is not fully synced

    Software RAID <disk_name> on node <hostname> is <value>% synced.

    Systemd service is flapping

    Systemd service <service_name> on node <hostname> has changed its state more than 5 times in 5 minutes or 15 times in one hour.

    SMART Media Wearout warning

    SMART Media Wearout critical

    Network interface alerts

    Network interface half duplex

    Network interface "<iface_name>" on node "<hostname>" is not in the full duplex mode.

    Low network interface speed

    Network interface "<iface_name>" on node "<hostname>" has speed lower than the minimally required 1 Gbps.

    Network interface is flapping

    Network interface <iface_name> on node <hostname> is flapping.

    Network bond is not redundant

    Network bond <iface_name> on node <hostname> is missing <number> subordinate interface(s).

    Data-in-transit encryption alerts

    Enabling IPv6 mode takes too much time

    Enabling traffic encryption takes too much time

    System configuration is not optimal for traffic encryption

    Node IPsec certificate will expire in less than 7 days

    Node IPsec certificate has expired

    Identity provider alerts

    Identity provider connection error

    Unable to connect to identity provider "<idp_name>" in domain "<name>".

    Identity provider validation error

    Invalid identity provider configuration "<idp_name>" in domain "<name>".