Compute alerts

Based on the metrics described in Compute metrics, the compute alerts are generated and displayed in the admin panel.

Compute service alerts

Compute node service is down

Service <service_name> is down on host <hostname>.

Keystone service is down

OpenStack service API upstream is down

All OpenStack service upstreams are down

OpenStack Cinder Scheduler is down

OpenStack Cinder Volume agent is down

OpenStack Neutron L3 agent is down

OpenStack Neutron Open vSwitch agent is down

OpenStack Neutron Metadata agent is down

OpenStack Neutron DHCP agent is down

OpenStack Nova Compute is down

OpenStack Nova Conductor is down

OpenStack Nova Scheduler is down

OpenStack Octavia Provisioning Worker is down

OpenStack Octavia Housekeeping service is down

OpenStack Octavia HealthManager service is down

High request error rate for OpenStack API requests detected

Compute backup service degraded

Compute backup service failed

Compute cluster alerts

Compute cluster has failed

Cluster is running out of vCPU resources

Cluster is out of vCPU resources

Cluster is running out of memory

Cluster is out of memory

Virtual machine is in error state

Virtual machine state mismatch

Virtual machine is not responding

Virtual machine has crashed

Volume attachment details mismatch

Volume is stuck in transitional state

Volume has incorrect status

Virtual network DHCP check failed

Virtual DHCP port check failed

Backup plan failed

Virtual router HA has more than one active L3 agent

Virtual router HA with ID <router_id> has more than one active L3 agent. Please contact the technical support.

Virtual router HA has no active L3 agent

Virtual router HA with ID <router_id> has no active L3 agent. Please contact the technical support.

Virtual router SNAT-related port has invalid host binding

Virtual router SNAT-related port with ID <id> is bound to the Standby HA router node. Please contact the technical support.

Virtual router gateway port has invalid host binding

Virtual router gateway port with ID <id> is bound to the Standby HA router node. Please contact the technical support.

Neutron bridge mapping not found

Physical network "<physical_network>" is not found in the bridge mapping on node "<hostname>". Virtual network "<virtual_network>" on this node is most likely not functioning. Please contact the technical support.

Virtual DHCP server is unavailable from node

Built-in DHCP server for virtual network "<network_id>" is not available from node "<hostname>". Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server is unavailable

Built-in DHCP server for virtual network "<network_id>" is not available from cluster nodes. Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server HA degraded on node

Only one built-in DHCP server for virtual network "<network_id>" is reachable from node "<hostname>". DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server HA degraded

Only one built-in DHCP server for virtual network "<network_id>" is reachable from cluster nodes. DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.

Unrecognized DHCP servers detected from node

Built-in DHCP service for virtual network "<network_id>" may be malfunctioning on node "<hostname>". Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.

Unrecognized DHCP servers detected

Built-in DHCP service for virtual network "<network_id>" may be malfunctioning. Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.

Load balancer is stuck in pending state

Load balancer with ID "<id>" is stuck with the "<provisioning_status>" status. Ensure that the load balancer configuration is consistent and perform a failover.

Load balancer error

Load balancer with ID "<id>" has the 'ERROR' provisioning status. Please check the Octavia service logs or contact the technical support.

Kubernetes cluster update failed

Kubernetes cluster with ID "<id>" has the "<status>" status.

Compute cluster not on optimal tier

Tier <tier> is not optimal for the compute cluster, as it may cause performance issues. Use a faster storage tier to ensure stable performance.

Compute node alerts

Node is running out of vCPU resources

Node is out of vCPU resources

Node is running out of memory

Node is out of memory

Node is fenced for over 1 hour

Node <hostname> with ID <id> was in a fenced state for at least 1 hour over the last 2 hours. Compute cluster capacity is reduced. Check node health and return it to operation manually.

Extra RAM reservation detected for compute placement service

Extra vCPU reservation detected for compute placement service

Extra RAM reservation detected on hypervisor node

Extra vCPU reservation detected on hypervisor node

Domain quota alerts

Domain is low on vCPU resources

Domain is out of vCPU resources

Domain is low on memory

Domain is out of memory

Domain is low on storage policy space

Domain is out of storage policy space

Project quota alerts

Project is out of vCPU resources

Project is out of memory

Project is out of floating IP addresses

Network is out of IP addresses

Project is out of storage policy space

Virtual machine alerts

Virtual machine error

One or more '<reason>' errors detected for VM instance: <instance>.

Migration job conflict

Another migration job is already running for VM instance: <instance>.

Temporary snapshot exists

Different operations with VMs might be blocked because temporary snapshot live migration leftovers are detected.

Failed to power off VM

One or more failed attempts to power off VM '<instance>' are detected.

VM device or resource busy

One or more failed VM disk I/O operations detected because device is busy.

VM invalid volume exception

One or more operations with invalid volume detected for VM '<instance>'.

Conflict updating instance

One or more 'UnexpectedTaskStateError_Remote' errors detected. Conflict updating VM instance: <instance>.

QEMU error

One or more '<reason>' errors detected in <instance>.log.

Long I/O request

Detected <reason> I/O on node '<hostname>' for VM <name> (<vm_uuid>).

VM has high receive packet drop rate

VM "<name>" network interface with MAC "<mac_address>" has high receive packet drop rate (<value> packets/s). This may indicate network congestion, high VM load, excessive traffic from other workloads, or abnormal network traffic. Check VM load, network settings, and virtual network configuration.

VM has high transmit packet drop rate

VM "<name>" network interface with MAC "<mac_address>" has high transmit packet drop rate (<value> packets/s). This may indicate network congestion, high VM load, excessive traffic from other workloads, or abnormal network traffic. Check VM load, network settings, and virtual network configuration.

Other alerts

Libvirt service is down

Virtualization service (libvirt) error

Errors in libvirt '<reason>' were detected on <hostname>. Managing VMs on this node will not be possible and VMs might become unavailable. Restart libvirtd.service on the node. If the issue persists, collect /var/log/libvirt* logs and contact technical support.

Libvirt service may have file descriptor leak

Libvirt service on <hostname> is using more file descriptors than expected for the number of running VMs, which may indicate a resource leak. This can affect stability and lead to failures when opening new files or connections. Restart the service, and if the issue persists, contact the technical support.

Docker service is down

RabbitMQ node is down

One or more nodes in the Rabbitmq cluster is down.

RabbitMQ cluster has split brain

RabbitMQ cluster has split brain due to a network partition. HA failover is not functional. Compute service requests may fail. Contact technical support immediately.

RabbitMQ node disk space is low

RabbitMQ node <hostname> is running low on disk space (<value>B available). Consider cleaning up old data or expanding the storage soon.

RabbitMQ node is out of disk space

RabbitMQ on <hostname> has <value>B of system disk space remaining. If the disk fills up, RabbitMQ will crash. HA failover and compute messaging will stop working, compute service requests may fail. Free up disk space or expand the system volume immediately.

PostgreSQL database size is greater than 30 GB

PostgreSQL database "<name>" on node "<hostname>" is greater than 30 GB in size. Verify that deleted entries are archived or contact the technical support.

PostgreSQL database uses more than 50% of node root partition

PostgreSQL databases on node "<hostname>" with ID "<id>" use more than 50% of node root partition. Verify that deleted entries are archived or contact the technical support.

Kafka SSL CA certificate will expire in less than 30 days

Kafka SSL CA certificate will expire in <number> days. Please renew the certificate.

Kafka CA certificate expired

Kafka CA certificate has expired. Kafka connectivity is disrupted. Renew the certificate immediately.

Kafka SSL client certificate will expire in less than 30 days

Kafka SSL client certificate will expire in <number> days. Please renew the certificate.

Kafka client certificate expired

Kafka client certificate has expired. Kafka connectivity is disrupted. Renew the certificate immediately.