Monitoring the compute cluster

After you create the compute cluster, you can monitor its status and statistics. Additionally, you can monitor separate compute nodes, virtual machines, and load balancers.

To view the compute cluster status

Click the cluster name at the bottom of the left menu. It can be one of the following:

Healthy: All compute cluster components and nodes operate normally.
Configuring: The compute cluster configuration (the default CPU model for VMs or the number of compute nodes) is changing.
Warning: The compute cluster operates normally but some issues have been detected.
Critical: The compute cluster has encountered a critical problem and is not operational.

To view the compute cluster statistics

Admin panel

Go to the Compute > Overview screen, which has the following charts:

The Reserved vCPUs chart displays vCPU reservations in the compute cluster. A vCPU reservation is a guarantee on vCPUs for a service or virtual machine.

The following statistics are available:

Total

The total number of virtual CPUs in the compute cluster. It is a product of the total number of physical CPUs on all compute nodes and the cluster overcommitment ratio.

System

The number of virtual CPUs reserved for the system and storage services on all nodes in the compute cluster. To learn more about CPU reservations for different services, refer to Server requirements.

VMs

The number of virtual CPUs provisioned for all virtual machines in the compute cluster.

Free

The number of free virtual CPUs on all nodes in the compute cluster.

Fenced

The number of virtual CPUs on all fenced nodes in the compute cluster.

Default cluster overcommitment ratio

The ratio of the number of virtual CPUs to physical.

The parameter is set in /etc/kolla/nova-compute/nova.conf. You can change it by using the command vinfra service compute set --nova-compute-cpu-allocation-ratio <value> (refer to Changing virtual CPU overcommitment).

A similar chart is available for each individual node in the compute cluster.
The Reserved RAM chart displays RAM reservations in the compute cluster. A RAM reservation is a guarantee on RAM for a service or virtual machine.

The following statistics are available:

Total

The total amount of RAM on all nodes in the compute cluster. It is a product of the total amount of physical RAM on all compute nodes and the overcommitment ratio.

System

The amount of RAM reserved for the system and storage services on all nodes in the compute cluster. To learn more about RAM reservations for different services, refer to Server requirements.

You can view RAM reservation details for all of the cluster nodes in the vinfra node ram-reservation list output.

VMs

The amount of RAM provisioned for all virtual machines in the compute cluster.

Free

The amount of free RAM on all nodes in the compute cluster.

Fenced

The amount of RAM on all fenced nodes in the compute cluster.

Default cluster overcommitment ratio

The ratio of the amount of maximum reserved RAM to physical.

The parameter is set in /etc/kolla/nova-compute/nova.conf. You can change it by using the command vinfra service compute set --nova-compute-ram-allocation-ratio <value> (refer to Configuring memory for virtual machines).

A similar chart is available for each individual node in the compute cluster.
The Provisioned storage chart shows usage of storage space by the compute cluster.

The following statistics are available:

Total

The total size of volumes provisioned in the compute cluster.

Used

The amount of storage space actually occupied by data in all volumes provisioned in the compute cluster.

Free

The amount of unused space in all volumes provisioned in the compute cluster.
The VMs status chart shows the total number of virtual machines in the compute cluster and groups them by status.

The VM status can be the following:

Running

The number of virtual machines that are up and running.

In progress

The number of virtual machines that are in a transitional state: building, restarting, migrating, etc.

Stopped

The number of virtual machines that are suspended or powered off.

Error

The number of virtual machines that have failed. You can reset the state for such VMs to their last stable state.

To see a full list of virtual machines filtered by the chosen status, click the number next to the status icon.
The Top VMs chart lists the virtual machines with the highest resource consumption sorted by vCPU, RAM, or Storage in descending order.

To switch between lists, click the desired resource. To see a full list of virtual machines in the compute cluster, click Show all.
The Alerts chart lists all of the alerts related to the compute cluster sorted by severity.

Alerts include the following:

Critical

The compute cluster has encountered a critical problem and is unmanageable. For example, an API service on all of the management nodes has failed or one of the compute agents is down. In this case, contact the technical support team.

Warning

The compute cluster is experiencing resource shortage or may become unmanageable. For example, an API service on one of the management nodes has failed or some resource has exceeded 95% of its allocation limit.

Info

The compute cluster is experiencing issues that may lead to resource shortage. For example, some resource has exceeded 80% of its allocation limit.

To see a full list of alerts, click Show all.

Command-line interface

Use the following command:

# vinfra service compute stat
+----------------+----------------------------------------------+
| Field          | Value                                        |
+----------------+----------------------------------------------+
| backup_plans   | count: 1                                     |
|                | scheduled: 1                                 |
| backups        | available: 1                                 |
|                | count: 1                                     |
| compute        | block_capacity: 2147483648                   |
|                | block_usage: 543162368                       |
|                | cpu_allocation_ratio: 8                      |
|                | cpu_usage: 0.07                              |
|                | ram_allocation_ratio: 1.0                    |
|                | vcpus: 2                                     |
|                | vcpus_free: 38                               |
|                | vm_mem_capacity: 48200712192                 |
|                | vm_mem_free: 47126970368                     |
|                | vm_mem_reserved: 1073741824                  |
|                | vm_mem_usage: 201162752                      |
| datetime       | 2025-02-24T15:07:33.576963                   |
| fenced         | physical_cpu_cores: 0                        |
|                | physical_cpu_usage: 0                        |
|                | physical_mem_total: 0                        |
|                | reserved_memory: 0                           |
|                | vcpus: 0                                     |
|                | vm_mem_capacity: 0                           |
| floating_ips   | active: 1                                    |
|                | count: 1                                     |
| images         | active: 1                                    |
|                | count: 1                                     |
| k8s_clusters   | count: 0                                     |
| load_balancers | count: 0                                     |
| networks       | active: 6                                    |
|                | count: 6                                     |
| physical       | block_capacity: 807980261376                 |
|                | block_free: 804296740864                     |
|                | cpu_cores: 12                                |
|                | cpu_usage: 10.99                             |
|                | mem_total: 74789638144                       |
|                | vcpus_total: 96                              |
| ports          | active: 19                                   |
|                | count: 20                                    |
|                | n/a: 1                                       |
| reserved       | cpus: 7                                      |
|                | memory: 26588925952                          |
|                | vcpus: 56                                    |
| routers        | active: 3                                    |
|                | count: 3                                     |
| servers        | active: 1                                    |
|                | count: 2                                     |
|                | error: 0                                     |
|                | in_progress: 0                               |
|                | running: 1                                   |
|                | shutoff: 1                                   |
|                | stopped: 1                                   |
|                | top:                                         |
|                |   disk:                                      |
|                |   - id: 6347a196-62aa-4f20-8b48-435e2c2a5bb9 |
|                |     name: cirros                             |
|                |     size: 274726912                          |
|                |   - id: 784bfe4c-5bae-4811-ad59-c52d5d62c66b |
|                |     name: test                               |
|                |     size: 268435456                          |
|                |   memory:                                    |
|                |   - id: 784bfe4c-5bae-4811-ad59-c52d5d62c66b |
|                |     name: test                               |
|                |     size: 201162752                          |
|                |   - id: 6347a196-62aa-4f20-8b48-435e2c2a5bb9 |
|                |     name: cirros                             |
|                |     size: 0                                  |
|                |   vcpus:                                     |
|                |   - count: 0.01                              |
|                |     id: 784bfe4c-5bae-4811-ad59-c52d5d62c66b |
|                |     name: test                               |
|                |   - count: 0                                 |
|                |     id: 6347a196-62aa-4f20-8b48-435e2c2a5bb9 |
|                |     name: cirros                             |
| snapshots      | available: 1                                 |
|                | count: 1                                     |
| stacks         | count: 0                                     |
| volumes        | available: 2                                 |
|                | count: 4                                     |
|                | in-use: 2                                    |
| vpns           | count: 0                                     |
+----------------+----------------------------------------------+

To view more details about the compute cluster

Go to the Monitoring > Dashboard screen, and then click Grafana dashboard. A separate browser tab will open with preconfigured Grafana dashboards.

The Compute service status dashboard shows the status of the compute services and agents on all of the compute nodes. You can sort the displayed services per hostname, service name, and service status.

On the Compute resource details dashboard, you can monitor all existing virtual objects in the compute cluster by status over time.

For the detailed monitoring of the compute resource allocation, use the Compute resource allocation dashboard. The charts on this dashboard show resource quotas, allocation usage, and ratio over time. You can view the statistics for all domains and projects, or filter the data per specific domain or project.

The Compute vCPU/RAM allocation and overcommitment ratio dashboard helps identify discrepancies between the expected and actual resource usage across compute nodes by displaying vCPU and RAM allocation dynamics per node reported by the Placement service and the hypervisor. It also shows the total amount of resources reserved for system services, as well as the percentage of vCPUs and RAM used by or allocated to VMs relative to the total available resources, excluding system reservations.

To monitor the compute API requests, use the Compute service API details dashboard. The charts on this dashboard show the rate of successful and failed requests, as well as the 95th and 99th percentiles of response time, per 10-minute intervals. You can filter the displayed requests per compute service. The most important charts here are those of error request rate and response time. If you see spikes on them, you need to check the status of the corresponding services.

The Compute RPC dashboard displays details on Remote Procedure Call (RPC) requests across the compute services. The RabbitMQ nodes, RabbitMQ messages, and RabbitMQ clients dashboards are intended for troubleshooting the RabbitMQ cluster by the support team. The PostgreSQL overview dashboard shows information about the PostgreSQL database size and replication status, as well as other database details. To see a detailed description for each chart, click the i icon in its left corner.