10.23. Monitoring Nodes and Virtual Environments via Prometheus¶
Note
The collected statistics are only intended for monitoring and are not suitable for billing purposes.
You can monitor nodes running Virtuozzo Hybrid Server 7.5 and newer as well as VEs hosted on them via Prometheus. A typical list of required components includes:
Prometheus, a service that grabs statistics from exporters and stores it in a time series database.
Alertmanager, a service that receives alerts from Prometheus and handles their delivery via various communication channels.
Grafana, a service that provides a web panel with flexible dashboards and supports Prometheus as a data source.
Exporters, services that are installed on Virtuozzo Hybrid Server nodes and export metrics via a simple HTTP server.
This guide describes how to install the exporters, configure an existing Prometheus service, and import the dashboards to an existing Grafana panel. For details on installing Prometheus, Alertmanager, and Grafana, see the respective documentation:
10.23.1. Installing the Exporters¶
Perform these steps on each Virtuozzo Hybrid Server 7.5 node that you want to monitor.
Install exporter packages:
# yum install node_exporter libvirt_exporter
Configure the firewall:
# firewall-cmd --permanent --zone=public --add-rich-rule='\ rule family="ipv4" \ source address="<prom_IP>/32" \ port protocol="tcp" port="9177" accept' # firewall-cmd --permanent --zone=public --add-rich-rule='\ rule family="ipv4" \ source address="<prom_IP>/32" \ port protocol="tcp" port="9100" accept' # firewall-cmd --reload
Where
<prom_IP>
is the Prometheus IP address, port 9177 is used by the libvirt exporter, and port 9100 is used by the node exporter.It is recommended to expose the metrics only to the Prometheus server. Unrestricted access to the metrics can be a security and stability risk.
To be able to monitor Virtuozzo Storage clients, open another port. For example:
# firewall-cmd --permanent --zone=public --add-rich-rule='\ rule family="ipv4"\ source address="<prom_IP>/32"\ port protocol="tcp" port="9999" accept' # firewall-cmd --reload
Launch the exporters:
# systemctl start node_exporter # systemctl start libvirt-exporter
After setting up the exporters, on any Virtuozzo Hybrid Server 7.5 node, obtain the sample configuration, rules, and alerts for Prometheus and dashboards for Grafana:
# yum install vz-prometheus-cfg
The files will be placed in /usr/share/vz-prometheus-cfg/
. For example:
# tree /usr/share/vz-prometheus-cfg/
/usr/share/vz-prometheus-cfg/
├── alerts
│ ├── vstorage-alerts.yml
│ └── vz-alerts.yml
├── dashboards
│ ├── grafana_hn_dashboard.json
│ ├── grafana_ve_dashboard.json
│ ├── grafana_win_ct_hn_dashboard.json
│ └── grafana_win_ct_ve_dashboard.json
├── prometheus-example.yml
├── rules
│ ├── vstorage-rules.yml
│ ├── vz-rules.yml
│ └── win_ct-rules.yml
└── targets
├── targets-example.yml
└── vstorage-targets-example.yml
10.23.2. Configuring Prometheus¶
You will need to configure Prometheus so it can start collecting metrics from Virtuozzo Hybrid Server nodes. To do this, modify prometheus.yml
based on the sample prometheus-example.yml
shipped with vz-prometheus-cfg
.
Copy the rule and alert files to the Prometheus server and set their paths in
rule_files
inprometheus.yml
(see the example further).Create target files that contain information about exporters you want to scrape. By using multiple target files you can group nodes by attributes like datacenter, cluster, and such. The following examples create a server group cluster1 populated with five nodes and scrape their node and libvirt exporters, respectively:
# cat cluster1-libvirt.yml - labels: group: cluster1 targets: - hn01.cluster1.tld:9177 - hn02.cluster1.tld:9177 - hn03.cluster1.tld:9177 - hn04.cluster1.tld:9177 - hn05.cluster1.tld:9177
# cat cluster1-nodes.yml - labels: group: cluster1 targets: - hn01.cluster1.tld:9100 - hn02.cluster1.tld:9100 - hn03.cluster1.tld:9100 - hn04.cluster1.tld:9100 - hn05.cluster1.tld:9100
If these nodes are in the Virtuozzo Storage cluster, create a dedicated target file to be able to monitor the Virtuozzo Storage clients as well. For example:
# cat cluster1.yml - labels: group: cluster1 targets: - hn01.cluster1.tld:9999 - hn02.cluster1.tld:9999 - hn03.cluster1.tld:9999 - hn04.cluster1.tld:9999 - hn05.cluster1.tld:9999
Set paths to target files in
scrape_configs
inprometheus.yml
(see the example further).The Virtuozzo Storage job must be named
fused
.
A complete Prometheus configuration file may look like this:
# cat prometheus.yml
global:
scrape_interval: 1m
evaluation_interval: 1m
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- /prometheus-<version>linux-amd64/rules/vz-rules.yml
- /prometheus-<version>linux-amd64/rules/vstorage-rules.yml
- /prometheus-<version>linux-amd64/alerts/vz-alerts.yml
- /prometheus-<version>linux-amd64/alerts/vstorage-alerts.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: libvirt
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: (.*)[:].+
file_sd_configs:
- files:
- /prometheus-<version>linux-amd64/targets/cluster1-libvirt.yml
- job_name: node
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: (.*)[:].+
file_sd_configs:
- files:
- /prometheus-<version>linux-amd64/targets/cluster1-nodes.yml
- job_name: fused
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: (.*)[:].+
file_sd_configs:
- files:
- /prometheus-<version>linux-amd64/targets/cluster1.yml
After editing the Prometheus configuration file, restart the prometheus
and alertmanager
services.
To enable monitoring of Virtuozzo Storage clients listed in the target file:
Adjust
fstab
on each node, using the previously chosen port. For example:
# cat /etc/fstab | grep ^vstorage
vstorage://cluster1 /vstorage/cluster1 fuse.vstorage defaults,_netdev,prometheus=0.0.0.0:9999 0 0
Stop all virtual environments and re-mount the Virtuozzo Storage file system on each node, one after another, so only one node is down at any given moment.
# umount /vstorage/stor1 # mount /vstorage/stor1
Then start all virtual environments once again.
Alternatively, reboot each node one after another, so only one node is down at any given moment. Virtual environments are set to start on node reboots by default.
10.23.3. Configuring Grafana¶
To see the data collected from the nodes in Grafana, do the following in the Grafana web panel:
If you have not already done so, add the configured Prometheus server as a data source:
Navigate to Configuration -> Data Sources.
Click Add data source, select Prometheus..
Enter a name for the data source.
Specify the Prometheus IP address and port.
Click Save & Test.
Import Virtuozzo Hybrid Server dashboards. Perform these steps for each JSON file shipped with
vz-prometheus-cfg
.Navigate to Dashboards -> Manage.
Click Import and Upload JSON file. Select a JSON file with a Grafana dashboard.
Select the previously configured Prometheus data source.
Click Import.
The Virtuozzo Hybrid Server dashboards are now available in Dashboards Home.
10.23.4. Supported Alerts¶
The following alerts are supported for Virtuozzo Hybrid Server.
Alert |
Severity |
Description |
What to do |
---|---|---|---|
nodeTcpListenDrops |
Error |
A large amount TCP packets have been dropped on the node. |
Inspect the network traffic for issues and fix them. |
nodeTcpRetransSegs |
Error |
A large amount of TCP packets have been retransmitted on the node. |
|
nodeOutOfMemory |
Error |
The node has run out of memory. |
Find out what has been consuming memory on the node and fix it. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes. |
nodeOutOfSwap |
Critical |
The node has run out of swap memory. |
|
nodeHighMemoryAllocationLatency |
Warning |
Node’s memory allocation latency is too high. The node may be overloaded, resulting in unpredictable delays in operations. |
|
nodeRxChecksummingDisabled |
Error |
The rx-checksumming feature is disabled on the network interface. |
These features are enabled by default. If they have been manually disabled, re-enable them for the network interface. |
nodeTxChecksummingDisabled |
Error |
The tx-checksumming feature is disabled on the network interface. |
|
nodeScatterGatherDisabled |
Error |
The scatter-gather feature is disabled on the network interface. |
|
nodeTCPSegmentationOffloadDisabled |
Error |
The tcp-segmentation-offload feature is disabled on the network interface. |
|
nodeGenericSegmentationOffloadDisabled |
Error |
The generic-segmentation-offload feature is disabled on the network interface. |
|
nodeVzLicenseInactive |
Critical |
Node’s Virtuozzo license is inactive. |
Check and update the license. |
guestPausedEIO |
Critical |
A virtual machine has been paused due to a disk I/O error. |
Make sure that the node’s partition where the VM’s disks are stored has not run out of space. If it has, free up more space for the VM. If there is enough free space, evacuate the virtual machine (and any other critical data) from the physical disk it is stored on. Replace the physical disk as it is about to fail. |
guestOsCrashed |
Critical |
A virtual machine has crashed after a BSOD or kernel panic in the guest OS. Collected for VMs only. |
Find out the reasons for the crash, fix the VM, and restart any services that will not do so automatically. |
nodeSMARTDiskError |
Critical |
A S.M.A.R.T counter for a node’s disk is below threshold. |
The node’s disk is about to fail. Replace it as soon as possible. |
nodeSMARTDiskWarning |
Warning |
A S.M.A.R.T counter for a node’s disk is greater than zero. |
Inspect the health of node’s disk. You may need to replace it soon. |
highCPUusage |
Warning |
Virtual environments’s CPU usage has been over 90% for the last 10 minutes. The alert is disabled by default. |
Check the virtual environment for potential problems, including software issues or malware. These alerts are disabled by default.
To enable any of them, uncomment them
in |
highMemUsage |
Warning |
Virtual environments’s memory usage has been over 95% for the last 10 minutes. The alert is disabled by default. |
|
cpuUsageIncrease |
Warning |
Virtual environments’s CPU usage has greatly increased compared to the previous week. The alert is disabled by default. |
|
memUsageIncrease |
Warning |
Virtual environments’s memory usage has greatly increased compared to the previous week. The alert is disabled by default. |
|
nodeHighDiskWriteLatency |
Warning |
Write operation latency for a node’s disk has been too high for the last 10 seconds. |
Find and fix the reason why the I/O requests have been taking so long. Reasons can be high I/O load or deterioration of the disk’s health. |
nodeHighDiskReadLatency |
Warning |
Read operation latency for a node’s disk has been too high for the last 10 seconds. |
|
pendingKernelReboot |
Warning |
The node has been updated but not rebooted to the latest kernel. |
Reboot the node and switch to the latest kernel. |
lowPageCache |
Warning |
The node has high load average and very small page cache. The node is overloaded, possibly due to memory overcommitment. |
Find out why the node is overloaded. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes. |
highPfcacheUsage |
Warning |
Node’s pfcache disk is 90% full and may run out of space. |
Add more space to the pfcache disk or clean it up. For details, see the Knowledge Base. |
highVstorageMountLatency |
Warning |
The latency of vstorage-mount requests on a node has been too high for the last 10 seconds. |
Check the health of the Virtuozzo Storage and its components. Fix the found issues. |
slowIoRequest |
Info |
Virtual machine’s I/O requests have been taking longer than 10 seconds. |
Inspect the storage, be it local disks or Virtuozzo Storage, for potential issues and fix them. |
unresponsiveBalloonDriver |
Info |
Virtual machine’s balloon driver is not responding. The VM has stopped reporting its memory usage statistics and is not releasing node’s memory automatically. |
Find out what has happened to the virtio_balloon kernel module inside the VM. Reload the module. |