10.22. Monitoring Nodes and Virtual Environments via Prometheus

Note

The collected statistics are only intended for monitoring and are not suitable for billing purposes.

You can monitor nodes running Virtuozzo Hybrid Server 7.5 and newer as well as VEs hosted on them via Prometheus. A typical list of required components includes:

  • Prometheus, a service that grabs statistics from exporters and stores it in a time series database.

  • Alertmanager, a service that receives alerts from Prometheus and handles their delivery via various communication channels.

  • Grafana, a service that provides a web panel with flexible dashboards and supports Prometheus as a data source.

  • Exporters, services that are installed on Virtuozzo Hybrid Server nodes and export metrics via a simple HTTP server.

This guide describes how to install the exporters, configure an existing Prometheus service, and import the dashboards to an existing Grafana panel. For details on installing Prometheus, Alertmanager, and Grafana, see the respective documentation:

10.22.1. Installing the Exporters

Perform these steps on each Virtuozzo Hybrid Server 7.5 node that you want to monitor.

  1. Install exporter packages:

    # yum install node_exporter libvirt_exporter
    
  2. Configure the firewall:

    # firewall-cmd --permanent --zone=public --add-rich-rule='\
    rule family="ipv4" \
    source address="<prom_IP>/32" \
    port protocol="tcp" port="9177" accept'
    # firewall-cmd --permanent --zone=public --add-rich-rule='\
    rule family="ipv4" \
    source address="<prom_IP>/32" \
    port protocol="tcp" port="9100" accept'
    # firewall-cmd --reload
    

    Where <prom_IP> is the Prometheus IP address, port 9177 is used by the libvirt exporter, and port 9100 is used by the node exporter.

    It is recommended to expose the metrics only to the Prometheus server. Unrestricted access to the metrics can be a security and stability risk.

    To be able to monitor Virtuozzo Storage clients, open another port. For example:

    # firewall-cmd --permanent --zone=public --add-rich-rule='\
    rule family="ipv4"\
    source address="<prom_IP>/32"\
    port protocol="tcp" port="9999" accept'
    # firewall-cmd --reload
    
  3. Launch the exporters:

    # systemctl start node_exporter
    # systemctl start libvirt-exporter
    

After setting up the exporters, on any Virtuozzo Hybrid Server 7.5 node, obtain the sample configuration, rules, and alerts for Prometheus and dashboards for Grafana:

# yum install vz-prometheus-cfg

The files will be placed in /usr/share/vz-prometheus-cfg/. For example:

# tree /usr/share/vz-prometheus-cfg/
/usr/share/vz-prometheus-cfg/
├── alerts
│   ├── vstorage-alerts.yml
│   └── vz-alerts.yml
├── dashboards
│   ├── grafana_hn_dashboard.json
│   ├── grafana_ve_dashboard.json
│   ├── grafana_win_ct_hn_dashboard.json
│   └── grafana_win_ct_ve_dashboard.json
├── prometheus-example.yml
├── rules
│   ├── vstorage-rules.yml
│   ├── vz-rules.yml
│   └── win_ct-rules.yml
└── targets
    ├── targets-example.yml
    └── vstorage-targets-example.yml

10.22.2. Configuring Prometheus

You will need to configure Prometheus so it can start collecting metrics from Virtuozzo Hybrid Server nodes. To do this, modify prometheus.yml based on the sample prometheus-example.yml shipped with vz-prometheus-cfg.

  1. Copy the rule and alert files to the Prometheus server and set their paths in rule_files in prometheus.yml (see the example further).

  2. Create target files that contain information about exporters you want to scrape. By using multiple target files you can group nodes by attributes like datacenter, cluster, and such. The following examples create a server group cluster1 populated with five nodes and scrape their node and libvirt exporters, respectively:

    # cat cluster1-libvirt.yml
    - labels:
        group: cluster1
      targets:
      - hn01.cluster1.tld:9177
      - hn02.cluster1.tld:9177
      - hn03.cluster1.tld:9177
      - hn04.cluster1.tld:9177
      - hn05.cluster1.tld:9177
    
    # cat cluster1-nodes.yml
    - labels:
        group: cluster1
      targets:
      - hn01.cluster1.tld:9100
      - hn02.cluster1.tld:9100
      - hn03.cluster1.tld:9100
      - hn04.cluster1.tld:9100
      - hn05.cluster1.tld:9100
    

    If these nodes are in the Virtuozzo Storage cluster, create a dedicated target file to be able to monitor the Virtuozzo Storage clients as well. For example:

    # cat cluster1.yml
    - labels:
        group: cluster1
      targets:
      - hn01.cluster1.tld:9999
      - hn02.cluster1.tld:9999
      - hn03.cluster1.tld:9999
      - hn04.cluster1.tld:9999
      - hn05.cluster1.tld:9999
    
  3. Set paths to target files in scrape_configs in prometheus.yml (see the example further).

    The Virtuozzo Storage job must be named fused.

A complete Prometheus configuration file may look like this:

# cat prometheus.yml
global:
  scrape_interval:     1m
  evaluation_interval: 1m
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
rule_files:
  - /prometheus-<version>linux-amd64/rules/vz-rules.yml
  - /prometheus-<version>linux-amd64/rules/vstorage-rules.yml
  - /prometheus-<version>linux-amd64/alerts/vz-alerts.yml
  - /prometheus-<version>linux-amd64/alerts/vstorage-alerts.yml
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: libvirt
    relabel_configs:
    - source_labels: [__address__]
      target_label: instance
      regex: (.*)[:].+
    file_sd_configs:
      - files:
        - /prometheus-<version>linux-amd64/targets/cluster1-libvirt.yml
  - job_name: node
    relabel_configs:
    - source_labels: [__address__]
      target_label: instance
      regex: (.*)[:].+
    file_sd_configs:
      - files:
        - /prometheus-<version>linux-amd64/targets/cluster1-nodes.yml
  - job_name: fused
    relabel_configs:
    - source_labels: [__address__]
      target_label: instance
      regex: (.*)[:].+
    file_sd_configs:
      - files:
        - /prometheus-<version>linux-amd64/targets/cluster1.yml

After editing the Prometheus configuration file, restart the prometheus and alertmanager services.

To enable monitoring of Virtuozzo Storage clients listed in the target file:

  1. Adjust fstab on each node, using the previously chosen port. For example:

# cat /etc/fstab | grep ^vstorage
vstorage://cluster1 /vstorage/cluster1 fuse.vstorage defaults,_netdev,prometheus=0.0.0.0:9999  0 0
  1. Stop all virtual environments and re-mount the Virtuozzo Storage file system on each node, one after another, so only one node is down at any given moment.

    # umount /vstorage/stor1
    # mount /vstorage/stor1
    

    Then start all virtual environments once again.

    Alternatively, reboot each node one after another, so only one node is down at any given moment. Virtual environments are set to start on node reboots by default.

10.22.3. Configuring Grafana

To see the data collected from the nodes in Grafana, do the following in the Grafana web panel:

  1. If you have not already done so, add the configured Prometheus server as a data source:

    1. Navigate to Configuration -> Data Sources.

    2. Click Add data source, select Prometheus..

    3. Enter a name for the data source.

    4. Specify the Prometheus IP address and port.

    5. Click Save & Test.

  2. Import Virtuozzo Hybrid Server dashboards. Perform these steps for each JSON file shipped with vz-prometheus-cfg.

    1. Navigate to Dashboards -> Manage.

    2. Click Import and Upload JSON file. Select a JSON file with a Grafana dashboard.

    3. Select the previously configured Prometheus data source.

    4. Click Import.

The Virtuozzo Hybrid Server dashboards are now available in Dashboards Home.

10.22.4. Supported Alerts

The following alerts are supported for Virtuozzo Hybrid Server.

Alert

Severity

Description

What to do

nodeTcpListenDrops

Error

A large amount TCP packets have been dropped on the node.

Inspect the network traffic for issues and fix them.

nodeTcpRetransSegs

Error

A large amount of TCP packets have been retransmitted on the node.

nodeOutOfMemory

Error

The node has run out of memory.

Find out what has been consuming memory on the node and fix it. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes.

nodeOutOfSwap

Critical

The node has run out of swap memory.

nodeHighMemoryAllocationLatency

Warning

Node’s memory allocation latency is too high. The node may be overloaded, resulting in unpredictable delays in operations.

nodeRxChecksummingDisabled

Error

The rx-checksumming feature is disabled on the network interface.

These features are enabled by default. If they have been manually disabled, re-enable them for the network interface.

nodeTxChecksummingDisabled

Error

The tx-checksumming feature is disabled on the network interface.

nodeScatterGatherDisabled

Error

The scatter-gather feature is disabled on the network interface.

nodeTCPSegmentationOffloadDisabled

Error

The tcp-segmentation-offload feature is disabled on the network interface.

nodeGenericSegmentationOffloadDisabled

Error

The generic-segmentation-offload feature is disabled on the network interface.

nodeVzLicenseInactive

Critical

Node’s Virtuozzo license is inactive.

Check and update the license.

guestPausedEIO

Critical

A virtual machine has been paused due to a disk I/O error.

Make sure that the node’s partition where the VM’s disks are stored has not run out of space. If it has, free up more space for the VM.

If there is enough free space, evacuate the virtual machine (and any other critical data) from the physical disk it is stored on. Replace the physical disk as it is about to fail.

guestOsCrashed

Critical

A virtual machine has crashed after a BSOD or kernel panic in the guest OS.

Collected for VMs only.

Find out the reasons for the crash, fix the VM, and restart any services that will not do so automatically.

nodeSMARTDiskError

Critical

A S.M.A.R.T counter for a node’s disk is below threshold.

The node’s disk is about to fail. Replace it as soon as possible.

nodeSMARTDiskWarning

Warning

A S.M.A.R.T counter for a node’s disk is greater than zero.

Inspect the health of node’s disk. You may need to replace it soon.

highCPUusage

Warning

Virtual environments’s CPU usage has been over 90% for the last 10 minutes.

The alert is disabled by default.

Check the virtual environment for potential problems, including software issues or malware.

These alerts are disabled by default. To enable any of them, uncomment them in vz-alerts.yml and restart Prometheus.

highMemUsage

Warning

Virtual environments’s memory usage has been over 95% for the last 10 minutes.

The alert is disabled by default.

cpuUsageIncrease

Warning

Virtual environments’s CPU usage has greatly increased compared to the previous week.

The alert is disabled by default.

memUsageIncrease

Warning

Virtual environments’s memory usage has greatly increased compared to the previous week.

The alert is disabled by default.

nodeHighDiskWriteLatency

Warning

Write operation latency for a node’s disk has been too high for the last 10 seconds.

Find and fix the reason why the I/O requests have been taking so long. Reasons can be high I/O load or deterioration of the disk’s health.

nodeHighDiskReadLatency

Warning

Read operation latency for a node’s disk has been too high for the last 10 seconds.

pendingKernelReboot

Warning

The node has been updated but not rebooted to the latest kernel.

Reboot the node and switch to the latest kernel.

lowPageCache

Warning

The node has high load average and very small page cache. The node is overloaded, possibly due to memory overcommitment.

Find out why the node is overloaded. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes.

highPfcacheUsage

Warning

Node’s pfcache disk is 90% full and may run out of space.

Add more space to the pfcache disk or clean it up. For details, see the Knowledge Base.

highVstorageMountLatency

Warning

The latency of vstorage-mount requests on a node has been too high for the last 10 seconds.

Check the health of the Virtuozzo Storage and its components. Fix the found issues.

slowIoRequest

Info

Virtual machine’s I/O requests have been taking longer than 10 seconds.

Inspect the storage, be it local disks or Virtuozzo Storage, for potential issues and fix them.

unresponsiveBalloonDriver

Info

Virtual machine’s balloon driver is not responding. The VM has stopped reporting its memory usage statistics and is not releasing node’s memory automatically.

Find out what has happened to the virtio_balloon kernel module inside the VM. Reload the module.