Compute alerts

Based on the metrics described in Compute metrics, the compute alerts are generated and displayed in the admin panel.

Compute service alerts

Compute node service is down

Service <service_name> is down on host <hostname>.

Keystone service is down

OpenStack service API upstream is down

All OpenStack service upstreams are down

OpenStack Cinder Scheduler is down

OpenStack Cinder Volume agent is down

OpenStack Neutron L3 agent is down

OpenStack Neutron Open vSwitch agent is down

OpenStack Neutron Metadata agent is down

OpenStack Neutron DHCP agent is down

OpenStack Nova Compute is down

OpenStack Nova Conductor is down

OpenStack Nova Scheduler is down

OpenStack Octavia Provisioning Worker is down

OpenStack Octavia Housekeeping service is down

OpenStack Octavia HealthManager service is down

High request error rate for OpenStack API requests detected

Compute backup service degraded

Compute backup service failed

Compute cluster alerts

Compute cluster has failed

Cluster is running out of vCPU resources

Cluster is out of vCPU resources

Cluster is running out of memory

Cluster is out of memory

Virtual machine is in error state

Virtual machine state mismatch

Virtual machine is not responding

Virtual machine has crashed

Volume attachment details mismatch

Volume is stuck in transitional state

Volume has incorrect status

Virtual network DHCP check failed

Virtual DHCP port check failed

Backup plan failed

Virtual router HA has more than one active L3 agent

Virtual router HA with ID <router_id> has more than one active L3 agent. Please contact the technical support.

Virtual router HA has no active L3 agent

Virtual router HA with ID <router_id> has no active L3 agent. Please contact the technical support.

Virtual router SNAT-related port has invalid host binding

Virtual router SNAT-related port with ID <id> is bound to the Standby HA router node. Please contact the technical support.

Virtual router gateway port has invalid host binding

Virtual router gateway port with ID <id> is bound to the Standby HA router node. Please contact the technical support.

Neutron bridge mapping not found

Physical network "<physical_network>" is not found in the bridge mapping on node "<hostname>". Virtual network "<virtual_network>" on this node is most likely not functioning. Please contact the technical support.

Virtual DHCP server is unavailable from node

Built-in DHCP server for virtual network "<network_id>" is not available from node "<hostname>". Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server is unavailable

Built-in DHCP server for virtual network "<network_id>" is not available from cluster nodes. Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server HA degraded on node

Only one built-in DHCP server for virtual network "<network_id>" is reachable from node "<hostname>". DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server HA degraded

Only one built-in DHCP server for virtual network "<network_id>" is reachable from cluster nodes. DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.

Unrecognized DHCP servers detected from node

Built-in DHCP service for virtual network "<network_id>" may be malfunctioning on node "<hostname>". Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.

Unrecognized DHCP servers detected

Built-in DHCP service for virtual network "<network_id>" may be malfunctioning. Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.

Load balancer is stuck in pending state

Load balancer with ID "<id>" is stuck with the "<provisioning_status>" status. Ensure that the load balancer configuration is consistent and perform a failover.

Load balancer error

Load balancer with ID "<id>" has the 'ERROR' provisioning status. Please check the Octavia service logs or contact the technical support.

Kubernetes cluster update failed

Kubernetes cluster with ID "<id>" has the "<status>" status.

Compute cluster not on optimal tier

Tier <tier> is not optimal for the compute cluster, as it may cause performance issues. Use a faster storage tier to ensure stable performance.

Compute node alerts

Node is running out of vCPU resources: Node <hostname> with ID <id> has reached 80% of the vCPU allocation limit.

The compute node may soon experience the lack of vCPU resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
Node is out of vCPU resources: Node <hostname> with ID <id> has reached 95% of the vCPU allocation limit.

The compute node will soon experience the lack of vCPU resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
Node is running out of memory: Node <hostname> with ID <id> has reached 80% of the memory allocation limit.

The compute node may soon experience the lack of RAM resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
Node is out of memory: Node <hostname> with ID <id> has reached 95% of the memory allocation limit.

The compute node will soon experience the lack of RAM resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
Node is fenced for over 1 hour: Node <hostname> with ID <id> was in a fenced state for at least 1 hour over the last 2 hours. Compute cluster capacity is reduced. Check node health and return it to operation manually.
Extra RAM reservation detected for compute placement service: Extra VM registrations consuming '<value>' GiB of RAM detected for the compute placement service on node '<hostname>'.

The compute node may experience the lack of RAM resources that will lead to inability to accommodate new virtual machines. Contact the technical support team.
Extra vCPU reservation detected for compute placement service: Extra VM registrations consuming '<value>' vCPUs detected for the compute placement service on node '<hostname>'.

The compute node may experience the lack of vCPU resources that will lead to inability to accommodate new virtual machines. Contact the technical support team.
Extra RAM reservation detected on hypervisor node: Extra VM registrations consuming '<value>' GiB of RAM detected on hypervisor node '<hostname>'.

This issue may lead to incorrect metric values and resource calculation on the node, and thus trigger false-positive alerts. Contact the technical support team.
Extra vCPU reservation detected on hypervisor node: Extra VM registrations consuming '<value>' vCPUs detected on hypervisor node '<hostname>'.

This issue may lead to incorrect metric values and resource calculation on the node, and thus trigger false-positive alerts. Contact the technical support team.

Domain quota alerts

Domain is low on vCPU resources: Domain <name> has reached <value_80<95>% of the vCPU allocation limit.

The domain will soon experience the lack of vCPU resources that will lead to inability to create new virtual machines. To avoid this, add more vCPUs to the domain quota.
Domain is out of vCPU resources: Domain <name> has reached <value_>=95>% of the vCPU allocation limit.

The domain will soon experience the lack of vCPU resources that will lead to inability to create new virtual machines. To avoid this, add more vCPUs to the domain quota.
Domain is low on memory: Domain <name> has reached <value_80<95>% of the memory allocation limit.

The domain will soon experience the lack of RAM resources that will lead to inability to create new virtual machines. To avoid this, add more RAM to the domain quota.
Domain is out of memory: Domain <name> has reached <value_>=95>% of the memory allocation limit.

The domain will soon experience the lack of RAM resources that will lead to inability to create new virtual machines. To avoid this, add more RAM to the domain quota.
Domain is low on storage policy space: Domain <name> has reached <value_80<95>% of the <policy_name> storage policy allocation limit.

The domain will soon experience the lack of storage policy space that will lead to inability to create new compute volumes with this storage policy. To avoid this, add more storage space to the domain quota.
Domain is out of storage policy space: Domain <name> has reached <value_>=95>% of the <policy_name> storage policy allocation limit.

The domain will soon experience the lack of storage policy space that will lead to inability to create new compute volumes with this storage policy. To avoid this, add more storage space to the domain quota.

Project quota alerts

Project is out of vCPU resources: Project <name> has reached 95% of the vCPU allocation limit.

The project will soon experience the lack of vCPU resources that will lead to inability to create new virtual machines. To avoid this, add more vCPUs to the project quota.
Project is out of memory: Project <name> has reached 95% of the memory allocation limit.

The project will soon experience the lack of RAM resources that will lead to inability to create new virtual machines. To avoid this, add more RAM to the project quota.
Project is out of floating IP addresses: Project <name> has reached 95% of the floating IP address allocation limit.

The project will soon experience the lack of floating IP addresses that will lead to inability to assign them to virtual machines. To avoid this, add more floating IPs to the project quota.
Network is out of IP addresses: Network <name> with ID <id> in project <name> has reached 95% of the IP address allocation limit.

The network will soon experience the lack of IP addresses that will lead to inability to connect new virtual machines to this network. To avoid this, add more allocation pools to the network.
Project is out of storage policy space: Project <name> has reached 95% of the <policy_name> storage policy allocation limit.

The project will soon experience the lack of storage policy space that will lead to inability to create new compute volumes with this storage policy. To avoid this, add more storage space to the project quota.

Virtual machine alerts

Virtual machine error: One or more '<reason>' errors detected for VM instance: <instance>.
Migration job conflict
Another migration job is already running for VM instance: <instance>.
Temporary snapshot exists
Different operations with VMs might be blocked because temporary snapshot live migration leftovers are detected.
Failed to power off VM
One or more failed attempts to power off VM '<instance>' are detected.
VM device or resource busy
One or more failed VM disk I/O operations detected because device is busy.
VM invalid volume exception
One or more operations with invalid volume detected for VM '<instance>'.
Conflict updating instance
One or more 'UnexpectedTaskStateError_Remote' errors detected. Conflict updating VM instance: <instance>.
QEMU error: One or more '<reason>' errors detected in <instance>.log.
Long I/O request: Detected <reason> I/O on node '<hostname>' for VM <name> (<vm_uuid>).
VM has high receive packet drop rate: VM "<name>" network interface with MAC "<mac_address>" has high receive packet drop rate (<value> packets/s). This may indicate network congestion, high VM load, excessive traffic from other workloads, or abnormal network traffic. Check VM load, network settings, and virtual network configuration.
VM has high transmit packet drop rate: VM "<name>" network interface with MAC "<mac_address>" has high transmit packet drop rate (<value> packets/s). This may indicate network congestion, high VM load, excessive traffic from other workloads, or abnormal network traffic. Check VM load, network settings, and virtual network configuration.

Other alerts

Libvirt service is down

Virtualization service (libvirt) error

Errors in libvirt '<reason>' were detected on <hostname>. Managing VMs on this node will not be possible and VMs might become unavailable. Restart libvirtd.service on the node. If the issue persists, collect /var/log/libvirt* logs and contact technical support.

Libvirt service may have file descriptor leak

Libvirt service on <hostname> is using more file descriptors than expected for the number of running VMs, which may indicate a resource leak. This can affect stability and lead to failures when opening new files or connections. Restart the service, and if the issue persists, contact the technical support.

Docker service is down

RabbitMQ node is down

One or more nodes in the Rabbitmq cluster is down.

RabbitMQ cluster has split brain

RabbitMQ cluster has split brain due to a network partition. HA failover is not functional. Compute service requests may fail. Contact technical support immediately.

RabbitMQ node disk space is low

RabbitMQ node <hostname> is running low on disk space (<value>B available). Consider cleaning up old data or expanding the storage soon.

RabbitMQ node is out of disk space

RabbitMQ on <hostname> has <value>B of system disk space remaining. If the disk fills up, RabbitMQ will crash. HA failover and compute messaging will stop working, compute service requests may fail. Free up disk space or expand the system volume immediately.

PostgreSQL database size is greater than 30 GB

PostgreSQL database "<name>" on node "<hostname>" is greater than 30 GB in size. Verify that deleted entries are archived or contact the technical support.

PostgreSQL database uses more than 50% of node root partition

PostgreSQL databases on node "<hostname>" with ID "<id>" use more than 50% of node root partition. Verify that deleted entries are archived or contact the technical support.

Kafka SSL CA certificate will expire in less than 30 days

Kafka SSL CA certificate will expire in <number> days. Please renew the certificate.

Kafka CA certificate expired

Kafka CA certificate has expired. Kafka connectivity is disrupted. Renew the certificate immediately.

Kafka SSL client certificate will expire in less than 30 days

Kafka SSL client certificate will expire in <number> days. Please renew the certificate.

Kafka client certificate expired

Kafka client certificate has expired. Kafka connectivity is disrupted. Renew the certificate immediately.