Compute alerts
Based on the metrics described in Compute metrics, the compute alerts are generated and displayed in the admin panel.
Compute service alerts
-
Keystone API service is down
-
<service_name> API service is down.
-
Check the status of the
vstorage-ui-keystone-admin
andvstorage-ui-keystone-public
services on the management node by running:# systemctl status vstorage-ui-keystone-admin.service # systemctl status vstorage-ui-keystone-public.service
- Check the service logs at /var/log/vstorage-ui-backend/uwsgi-keystone-admin.log and /var/log/vstorage-ui-backend/uwsgi-keystone-public.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack service API upstream is down
-
One or more OpenStack <service_name> API upstreams are down.
-
Ensure that the container of the affected compute service is up on the management node. For example, for the
nova_api
service, run:# docker ps --all | grep nova_api
-
If the container is down, start it. For example, for the
nova_api
service, run:# docker start nova_api
- Check the service log (refer to Viewing cluster logs).
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
All OpenStack service API upstreams are down
-
All OpenStack <service_name> API upstreams are down.
-
Ensure that the container of the affected compute services are up on the management node. For example, for the
nova_api
service, run:# docker ps --all | grep nova_api
-
If the containers are down, start them. For example, for the
nova_api
service, run:# docker start nova_api
- Check the service logs (refer to Viewing cluster logs).
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Cinder Scheduler is down
-
OpenStack Block Storage (Cinder) Scheduler agent is down on host <hostname>.
-
Ensure that the
cinder_scheduler
container is up on the specified node by running:# docker ps --all | grep cinder_scheduler
-
If the container is down, start it by running:
# docker start cinder_scheduler
- Check the service log at /var/log/hci/cinder/cinder-scheduler.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Cinder Volume agent is down
-
OpenStack Block Storage (Cinder) Volume agent is down on host <hostname>.
-
Ensure that the
cinder_volume
container is up on the specified node by running:# docker ps --all | grep cinder_volume
-
If the container is down, start it by running:
# docker start cinder_volume
- Check the service log at /var/log/hci/cinder/cinder-volume.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Neutron L3 agent is down
-
OpenStack Networking (Neutron) L3 agent is down on host <hostname>.
-
Ensure that the
neutron_l3_agent
container is up on the specified node by running:# docker ps --all | grep neutron_l3_agent
-
If the container is down, start it by running:
# docker start neutron_l3_agent
- Check the service log at /var/log/hci/neutron/neutron-l3-agent.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Neutron Open vSwitch agent is down
-
OpenStack Networking (Neutron) Open vSwitch agent is down on host <hostname>.
-
Ensure that the
neutron_openvswitch_agent
container is up on the specified node by running:# docker ps --all | grep neutron_openvswitch_agent
-
If the container is down, start it by running:
# docker start neutron_openvswitch_agent
- Check the service log at /var/log/hci/neutron/neutron-openvswitch-agent.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Neutron Metadata agent is down
-
OpenStack Networking (Neutron) Metadata agent is down on host <hostname>.
-
Ensure that the
neutron_metadata_agent
container is up on the specified node by running:# docker ps --all | grep neutron_metadata_agent
-
If the container is down, start it by running:
# docker start neutron_metadata_agent
- Check the service log at /var/log/hci/neutron/neutron-metadata-agent.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Neutron DHCP agent is down
-
OpenStack Networking (Neutron) DHCP agent is down on host <hostname>.
-
Ensure that the
neutron_dhcp_agent
container is up on the specified node by running:# docker ps --all | grep neutron_dhcp_agent
-
If the container is down, start it by running:
# docker start neutron_dhcp_agent
- Check the service log at /var/log/hci/neutron/neutron-dhcp-agent.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Nova Compute is down
-
OpenStack Compute (Nova) agent is down on host <hostname>.
-
Ensure that the
nova_compute
container is up on the specified node by running:# docker ps --all | grep nova_compute
-
If the container is down, start it by running:
# docker start nova_compute
- Check the service log at /var/log/hci/nova/nova-compute.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Nova Conductor is down
-
OpenStack Compute (Nova) Conductor agent is down on host <hostname>.
-
Ensure that the
nova_conductor
container is up on the specified node by running:# docker ps --all | grep nova_conductor
-
If the container is down, start it by running:
# docker start nova_conductor
- Check the service log at /var/log/hci/nova/nova-conductor.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Nova Scheduler is down
-
OpenStack Compute (Nova) Scheduler agent is down on host <hostname>.
-
Ensure that the
nova_scheduler
container is up on the specified node by running:# docker ps --all | grep nova_scheduler
-
If the container is down, start it by running:
# docker start nova_scheduler
- Check the service log at /var/log/hci/nova/nova-scheduler.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Octavia Provisioning Worker v1 is down
-
OpenStack Loadbalancing (Octavia) provisioning worker version 1 is down on host <hostname>.
-
Ensure that the
octavia_worker
container is up on the management node by running:# docker ps --all | grep octavia_worker
-
If the container is down, start it by running:
# docker start octavia_worker
- Check the service log at /var/log/hci/octavia/octavia-worker.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Octavia Provisioning Worker v2 is down
-
OpenStack Loadbalancing (Octavia) provisioning worker version 2 is down on host <hostname>.
-
Ensure that the
octavia_worker
container is up on the management node by running:# docker ps --all | grep octavia_worker
-
If the container is down, start it by running:
# docker start octavia_worker
- Check the service log at /var/log/hci/octavia/octavia-worker.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Octavia Housekeeping service is down
-
OpenStack Loadbalancing (Octavia) housekeeping service is down on host <hostname>.
-
Ensure that the
octavia_housekeeping
container is up on the management node by running:# docker ps --all | grep octavia_housekeeping
-
If the container is down, start it by running:
# docker start octavia_housekeeping
- Check the service log at /var/log/hci/octavia/octavia-housekeeping.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
OpenStack Octavia HealthManager service is down
-
OpenStack Loadbalancing (Octavia) health manager service is down on host <hostname>.
-
Ensure that the
octavia_health_manager
container is up on the management node by running:# docker ps --all | grep octavia_health_manager
-
If the container is down, start it by running:
# docker start octavia_health_manager
- Check the service log at /var/log/hci/octavia/octavia-health-manager.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
-
High request error rate for OpenStack API requests detected
-
Request error rate more than 5% detected for <object_id> for the last 1 hour. Check <object_id> resource usage.
- Check the status of the affected compute services.
- If some services are down, bring them up.
- If you cannot troubleshoot the problem, contact the technical support team.
Compute cluster alerts
-
Compute cluster has failed
-
Compute cluster has failed. Unable to manage virtual machines.
- Go to the Monitoring > Dashboard screen, and then click Grafana dashboard.
- Open the Compute service status dashboard and find out the failed service.
- Depending on the service, follow the instructions from the Compute service alerts section.
-
Cluster is running out of vCPU resources
-
Cluster has reached 80% of the vCPU allocation limit.
The compute cluster may soon experience the lack of vCPU resources that will lead to inability to accommodate new virtual machines. To avoid this, you can add more compute nodes or return to operation fenced nodes, if any.
-
Cluster is out of vCPU resources
-
Cluster has reached 95% of the vCPU allocation limit.
The compute cluster will soon experience the lack of vCPU resources that will lead to inability to accommodate new virtual machines. To avoid this, you can add more compute nodes or return to operation fenced nodes, if any.
-
Cluster is running out of memory
-
Cluster has reached 80% of the memory allocation limit.
The compute cluster may soon experience the lack of RAM resources that will lead to inability to accommodate new virtual machines. To avoid this, you can add more compute nodes or return to operation fenced nodes, if any.
-
Cluster is out of memory
-
Cluster has reached 95% of the memory allocation limit.
The compute cluster will soon experience the lack of RAM resources that will lead to inability to accommodate new virtual machines. To avoid this, you can add more compute nodes or return to operation fenced nodes, if any.
-
Virtual machine error
-
Virtual machine <name> with ID <id> is in the 'Error' state.
- Examine the VM history in the History tab on the VM right pane and reset the VM state, as described in Troubleshooting virtual machines.
- If you cannot troubleshoot the problem, contact the technical support team.
-
Virtual machine state mismatch
-
State of virtual machine <name> with ID <id> differs in the Nova databases and libvirt configuration.
Do not try to migrate the VM or reset its state. Contact the technical support team.
-
Volume attachment details mismatch
-
Attachment details for volume with ID <id> differ in the Nova and libvirt databases.
Do not try to migrate the VM or reset its state. Contact the technical support team.
-
Virtual network port check failed
-
Neutron port with ID <port_id> failed <check_type> check. The port type is <device_owner> with owner ID <device_id>.
- Run the
openstack --insecure port check
command specifying the port ID. - Check connectivity of the device owner.
- If you cannot troubleshoot the problem, contact the technical support team.
- Run the
-
Backup plan failed
-
Backup plan <plan_name> for compute volumes has three consecutive failures.
- Check the service logs at /var/log/hci/freezer/freezer-scheduler.log and /var/log/hci/cinder/cinder-backup.log.
- If you cannot troubleshoot the problem, contact the technical support team.
-
Virtual router HA has more than one active L3 agent
-
Virtual router HA with ID <router_id> has more than one active L3 agent. Please contact the technical support.
-
Virtual router HA has no active L3 agent
-
Virtual router HA with ID <router_id> has no active L3 agent. Please contact the technical support.
-
Virtual router SNAT-related port has invalid host binding
-
Virtual router SNAT-related port with ID <id> is bound to the Standby HA router node. Please contact the technical support.
-
Virtual router gateway port has invalid host binding
-
Virtual router gateway port with ID <id> is bound to the Standby HA router node. Please contact the technical support.
-
Neutron bridge mapping not found
-
Physical network "<physical_network>" is not found in the bridge mapping on node "<hostname>". Virtual network "<virtual_network>" on this node is most likely not functioning. Please contact the technical support.
-
Virtual DHCP server is unavailable from node
-
Built-in DHCP server for virtual network "<network_id>" is not available from node "<hostname>". Please check the
neutron-dhcp-agent
service or contact the technical support. -
Virtual DHCP server is unavailable
-
Built-in DHCP server for virtual network "<network_id>" is not available from cluster nodes. Please check the
neutron-dhcp-agent
service or contact the technical support. -
Virtual DHCP server HA degraded on node
-
Only one built-in DHCP server for virtual network "<network_id>" is reachable from node "<hostname>". DHCP high availability entered the degraded state. Please check the
neutron-dhcp-agent
service or contact the technical support. -
Virtual DHCP server HA degraded
-
Only one built-in DHCP server for virtual network "<network_id>" is reachable from cluster nodes. DHCP high availability entered the degraded state. Please check the
neutron-dhcp-agent
service or contact the technical support. -
Unrecognized DHCP servers detected from node
-
Built-in DHCP service for virtual network "<network_id>" may be malfunctioning on node "<hostname>". Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.
-
Unrecognized DHCP servers detected
-
Built-in DHCP service for virtual network "<network_id>" may be malfunctioning. Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.
Compute node alerts
-
Node is running out of vCPU resources
-
Node <node> with ID <id> has reached 80% of the vCPU allocation limit.
The compute node may soon experience the lack of vCPU resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
-
Node is out of vCPU resources
-
Node <node> with ID <id> has reached 95% of the vCPU allocation limit.
The compute node will soon experience the lack of vCPU resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
-
Node is running out of memory
-
Node <node> with ID <id> has reached 80% of the memory allocation limit.
The compute node may soon experience the lack of RAM resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
-
Node is out of memory
-
Node <node> with ID <id> has reached 95% of the memory allocation limit.
The compute node will soon experience the lack of RAM resources that will lead to inability to accommodate new virtual machines. To avoid this, check the distribution of VMs in the compute cluster, and then migrate the VMs from the specified node to less loaded compute nodes.
-
Node had a fenced state for 1 hour
-
For the last 2 hours node <node> with ID <id> had a fenced state at least for 1 hour.
Domain quota alerts
-
Domain is out of vCPU resources
-
Domain <name> has reached <value_80<95>% of the vCPU allocation limit.
The domain will soon experience the lack of vCPU resources that will lead to inability to create new virtual machines. To avoid this, add more vCPUs to the domain quota.
-
Domain is out of vCPU resources
-
Domain <name> has reached <value_>=95>% of the vCPU allocation limit.
The domain will soon experience the lack of vCPU resources that will lead to inability to create new virtual machines. To avoid this, add more vCPUs to the domain quota.
-
Domain is out of memory
-
Domain <name> has reached <value_80<95>% of the memory allocation limit.
The domain will soon experience the lack of RAM resources that will lead to inability to create new virtual machines. To avoid this, add more RAM to the domain quota.
-
Domain is out of memory
-
Domain <name> has reached <value_>=95>% of the memory allocation limit.
The domain will soon experience the lack of RAM resources that will lead to inability to create new virtual machines. To avoid this, add more RAM to the domain quota.
-
Domain is out of storage policy space
-
Domain <name> has reached <value_80<95>% of the <policy_name> storage policy allocation limit.
The domain will soon experience the lack of storage policy space that will lead to inability to create new compute volumes with this storage policy. To avoid this, add more storage space to the domain quota.
-
Domain is out of storage policy space
-
Domain <name> has reached <value_>=95>% of the <policy_name> storage policy allocation limit.
The domain will soon experience the lack of storage policy space that will lead to inability to create new compute volumes with this storage policy. To avoid this, add more storage space to the domain quota.
Project quota alerts
-
Project is out of vCPU resources
-
Project <name> has reached 95% of the vCPU allocation limit.
The project will soon experience the lack of vCPU resources that will lead to inability to create new virtual machines. To avoid this, add more vCPUs to the project quota.
-
Project is out of memory
-
Project <name> has reached 95% of the memory allocation limit.
The project will soon experience the lack of RAM resources that will lead to inability to create new virtual machines. To avoid this, add more RAM to the project quota.
-
Project is out of floating IP addresses
-
Project <name> has reached 95% of the floating IP address allocation limit.
The project will soon experience the lack of floating IP addresses that will lead to inability to assign them to virtual machines. To avoid this, add more floating IPs to the project quota.
-
Network is out of IP addresses
-
Network <name> with ID <id> in project <name> has reached 95% of the IP address allocation limit.
The network will soon experience the lack of IP addresses that will lead to inability to connect new virtual machines to this network. To avoid this, add more allocation pools to the network.
-
Project is out of storage policy space
-
Project <name> has reached 95% of the <policy_name> storage policy allocation limit.
The project will soon experience the lack of storage policy space that will lead to inability to create new compute volumes with this storage policy. To avoid this, add more storage space to the project quota.
Other alerts
-
Libvirt service is down
-
Libvirt service is down on node <node> with ID <id>. Check the service state and start it. If the service cannot start, contact the technical support.
Start the
libvirtd
service on the specified node by running:# systemctl start libvirtd.service
-
Docker service is down
-
Docker service is down on host <hostname>.
Start the Docker service on the specified node by running:
# systemctl start docker.service
-
RabbitMQ node is down
-
One or more nodes in the Rabbitmq cluster is down.
Contact the technical support team.
-
RabbitMQ split brain detected
-
RabbitMQ cluster has experienced a split brain due to a network partition.
Contact the technical support team.
-
PostgreSQL database size is greater than 30 GB
-
PostgreSQL database "<name>" on node "<hostname>" is greater than 30 GB in size. Verify that deleted entries are archived or contact the technical support.
-
PostgreSQL database uses more than 50% of node root partition
-
PostgreSQL databases on node "<hostname>" with ID "<id>" use more than 50% of node root partition. Verify that deleted entries are archived or contact the technical support.
-
Kafka SSL CA certificate will expire in less than 30 days
-
Kafka SSL CA certificate will expire in <number> days. Please renew the certificate.
-
Kafka SSL CA certificate has expired
-
Kafka SSL CA certificate has expired. Please renew the certificate.
-
Kafka SSL client certificate will expire in less than 30 days
-
Kafka SSL client certificate will expire in <number> days. Please renew the certificate.
-
Kafka SSL client certificate has expired
-
Kafka SSL client certificate has expired. Please renew the certificate.