Infrastructure alerts
The infrastructure alerts that are generated and displayed in the admin panel can be related to the management node and admin panel, license and updates, cluster connectivity, as well as to a specific node, its disks, and network interfaces.
Management node and admin panel alerts
-
High availability for the admin panel must be configured
-
Configure high availability for the admin panel in Settings > System settings > High availability configuration. Otherwise the admin panel will be a single point of failure.
-
Management node backup does not exist
-
Management node backup is older than <number> days.
-
Management node backup does not exist
-
The last management node backup has failed, does not exist, or is too old.
-
Changes to the management database are not replicated
-
Changes to the management database are not replicated to the node "<hostname>" because it is offline. Check the node's state and connectivity.
-
Changes to the management database are not replicated
-
Changes to the management database are not replicated to the node "<hostname>". Please contact the technical support.
-
Management node HA has four nodes
-
The management node HA configuration has four nodes. It is recommended to have three or five nodes included.
-
Management panel SSL certificate will expire in less than 30 days
-
Management panel SSL certificate will expire in less than 7 days
-
Management panel SSL certificate has expired
License alerts
-
License is not loaded
-
License is not loaded.
- Contact the sales representative to obtain a license key.
- Register the license key, as described in Managing licenses.
-
License is not updated
-
The license cannot be updated automatically and will expire in less than 21 days. Check the cluster connectivity to the license server or contact the technical support.
-
- Calculate the licensed storage capacity that you need. To do it, you can use the previous licensed storage size or check the consumed storage capacity on the Monitoring > Dashboard > Logical space chart.
- Contact the sales representative to prolong your license, with the storage capacity you have defined on step 1.
- Upgrade your license key, as described in Managing licenses.
-
-
License will expire soon
-
The license has not been updated automatically and will expire in less than 7 days. Check the cluster connectivity to the license server and contact the technical support immediately.
-
- Calculate the licensed storage capacity that you need. To do it, you can use the previous licensed storage size or check the consumed storage capacity on the Monitoring > Dashboard > Logical space chart.
- Contact the sales representative to prolong your license, with the storage capacity you have defined on step 1.
- Upgrade your license key, as described in Managing licenses.
-
-
License expired
-
The license of cluster "<cluster_name>" has expired. Сontact your reseller to update your license immediately!
- Calculate the licensed storage capacity that you need. To do it, you can use the previous licensed storage size or check the consumed storage capacity on the Monitoring > Dashboard > Logical space chart.
- Contact the sales representative to prolong your license, with the storage capacity you have defined on step 1.
- Upgrade your license key, as described in Managing licenses.
-
Licensed storage capacity is low
-
Cluster has reached 80% of licensed storage capacity.
- Check the licensed and consumed storage capacity on the Monitoring > Dashboard > Logical space chart.
- Contact the sales representative to add more storage capacity.
- Register a new license key, as described in Managing licenses.
-
Licensed storage capacity is critically low
-
Cluster has reached 90% of licensed storage capacity.
- Check the licensed and consumed storage capacity on the Monitoring > Dashboard > Logical space chart.
- Contact the sales representative to add more storage capacity.
- Register a new license key, as described in Managing licenses.
-
Cluster is out of licensed space
-
Сluster "<cluster_name>" has run out of storage space allowed by license. No more data can be written. Please contact your reseller to update your license immediately!
- Check the licensed and consumed storage capacity on the Monitoring > Dashboard > Logical space chart.
- Contact the sales representative to add more storage capacity.
- Register a new license key, as described in Managing licenses.
Update alerts
-
Software updates exist
-
Software updates exist for the node <hostname>. Current version: <current_version>. Available version: <available_version>.
Update Virtuozzo Hybrid Infrastructure to a new version, as described in Installing updates.
-
Software updates exist for the Management panel and compute API
-
Software updates exist for the Management panel and compute API. Current version: <current_version>. Available version: <available_version>.
Update Virtuozzo Hybrid Infrastructure to a new version, as described in Installing updates.
-
Update check failed
-
Update check failed on the node <hostname>.
The connection to the update repository could not be established.
Check access to the update repositories:
- Open the terminal on the node where the update check has failed.
-
Ensure that the
hci-base
andhci-updates
repositories are enabled and themirrorlist
URL matches the current release. To do this, run:# cat /etc/hci-release # grep -P "^(mirrorlist|enabled)" /etc/yum.repos.d/hci.repo
-
Disable third-party repositories. Only two repositories must be enabled:
hci-base
andhci-updates
. To check the enabled repositories, run:# yum -q repolist enabled
-
Check access to the repositories:
# yum clean all; yum repoinfo hci-base; yum repoinfo hci-updates
-
Get the mirrorlist content by running:
# curl -L <mirrorlist_URL>
- Investigate the log file /var/log/vstorage-ui-agent/updater.log.
-
Multiple update checks failed
-
Update checks failed multiple times on the node <hostname>.
The connection to the update repository could not be established for at least 3 days.
Check access to the update repositories:
- Open the terminal on the node where the update check has failed.
-
Ensure that the
hci-base
andhci-updates
repositories are enabled and themirrorlist
URL matches the current release. To do this, run:# cat /etc/hci-release # grep -P "^(mirrorlist|enabled)" /etc/yum.repos.d/hci.repo
-
Disable third-party repositories. Only two repositories must be enabled:
hci-base
andhci-updates
. To check the enabled repositories, run:# yum -q repolist enabled
-
Check access to the repositories:
# yum clean all; yum repoinfo hci-base; yum repoinfo hci-updates
-
Get the mirrorlist content by running:
# curl -L <mirrorlist_URL>
- Investigate the log file /var/log/vstorage-ui-agent/updater.log.
-
Update download failed
-
Update download failed on the node <hostname>.
The reason can be one of the following:
- The update check has failed.
- There is not enough free space on the root file system.
- (Rare) A new version became available during the download, and the version that is currently being downloaded is not available anymore.
To solve the issue:
-
Check access to the update repositories:
- Open the terminal on the node where the update check has failed.
-
Ensure that the
hci-base
andhci-updates
repositories are enabled and themirrorlist
URL matches the current release. To do this, run:# cat /etc/hci-release # grep -P "^(mirrorlist|enabled)" /etc/yum.repos.d/hci.repo
-
Disable third-party repositories. Only two repositories must be enabled:
hci-base
andhci-updates
. To check the enabled repositories, run:# yum -q repolist enabled
-
Check access to the repositories:
# yum clean all; yum repoinfo hci-base; yum repoinfo hci-updates
-
Get the mirrorlist content by running:
# curl -L <mirrorlist_URL>
- Investigate the log file /var/log/vstorage-ui-agent/updater.log.
- Ensure that the root file system has more than 1 GB of free space left.
- Retry the update download.
-
Node update failed
-
Software update failed on the node <hostname>.
The reason can be one of the following:
- The node rebooted unexpectedly.
- There are third-party packages, which conflict with the official packages.
Try updating the node again. If the issue persists, contact the technical support team.
-
Update failed
-
Update failed for the management panel and compute API.
The reason can be one of the following:
- The management node rebooted unexpectedly.
- Failed to update the compute service.
Try updating the component again. If the issue persists, contact the technical support team.
-
Cluster update failed
-
Update failed for the cluster.
Try updating the cluster again. If the issue persists, contact the technical support team.
-
Entering maintenance for update failed
-
Entering maintenance failed while updating the node <hostname>.
Contact the technical support team.
Cluster connectivity alerts
-
Network connectivity failed
-
Node "<hostname>" has network connectivity problems: unstable connectivity via network "<network_name>" due to the loss of all MTU-sized packets.
-
Cluster network connectivity problem
-
All nodes have network connectivity problems: unstable connectivity via network "<network_name>" due to packet loss.
-
Cluster network connectivity problem
-
All nodes have network connectivity problems: no connectivity via network "<network_name>".
-
Node network connectivity problem
-
Node "<hostname>" has network connectivity problems: no connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>".
-
Node network unstable connectivity
-
Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to the loss of all MTU-sized packets.
-
Node network packet loss
-
Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to packet loss.
-
Node network some MTU packet loss
-
Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to the loss of some MTU-sized packets.
-
MTU mismatch
-
Network <network_name> has assigned interfaces with different MTU.
Node alerts
-
Node is offline
-
Node “<hostname>” is offline.
-
Node got offline too many times
-
Node “<hostname>” got offline too many times for the last hour.
-
Kernel is outdated
-
Node “<hostname>” is not running the latest kernel.
-
OOM killer triggered
-
OOM killer has been triggered on node “<hostname>”.
-
Time is not synced
-
Time on node “<hostname>” differs from time on backend node by more than <value_5<30> seconds.
-
Node time critically unsynced
-
Time on node <hostname> is critically unsynced, differing from the time on backend node by more than <value_>30> seconds.
-
No Internet access
-
Cluster node <hostname> cannot reach the repository. Make sure that all cluster nodes have Internet access.
-
Incompatible hardware detected
-
Incompatible hardware detected on node "<hostname>": <hardware_list>. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled. Visit https://support.virtuozzo.com/hc/en-us/articles/19764365143953 to learn how to troubleshoot this issue.
-
Node has high CPU usage
-
Node <hostname> has CPU usage higher than 90%. The current value is <value>%.
-
Node has high memory usage
-
Node <hostname> has memory usage higher than 95%. The current value is <value>%.
-
Node has high disk I/O usage
-
Disk /dev/<disk_name> on node <hostname> has I/O usage higher than 85%. The current value is <value>%.
-
Node has high swap usage
-
Node <hostname> has swap usage higher than 40%. The current value is <value>%.
-
Node has critically high swap usage
-
Node <hostname> has critically high swap usage, exceeding 80%. The current value is <value>%.
-
Node has high receive packet loss rate
-
Node <hostname> has <value> receive packet loss rate reported by job <job_name>.
-
Node has high transmit packet loss rate
-
Node <hostname> has <value> transmit packet loss rate reported by job <job_name>.
-
Node has high receive packet error rate
-
Node <hostname> has <value> receive packet error rate reported by job <job_name>.
-
Node has high transmit packet error rate
-
Node <hostname> has <value> transmit packet error rate reported by job <job_name>.
-
Reached "node crash per hour" threshold
-
Node <hostname> with shaman node ID <id> has reached the "node crash per hour" threshold.
-
OOM-kill event detected
-
OOM-kill event detected on node <hostname> at least once for the last 24 hours. You need to check memory consumption.
-
Node failed to return to operation
-
Node <hostname> has failed to automatically return to operation within 30 minutes after a crash. Check the node's hardware, and then try returning it to operation manually.
-
Node crash detected
-
Node <hostname> crashed, which started the VM evacuation.
Disk alerts
-
S.M.A.R.T. warning
-
Disk “<disk_name>”(<serial>) on node “<hostname>” has failed a S.M.A.R.T. check.
-
Disk error
-
Disk “<disk_name>” (<serial>) failed on node “<hostname>”.
-
Disk is running out of space
-
Root partition on node “<hostname>” is running out of space (less than 10% or 5 GB of free space).
-
Disk is out of space
-
Root partition on node “<hostname>” is running out of space (for non-compute nodes: less than 5% of free space).
-
Compute node disk is out of space
-
Root partition on node “<hostname>” is running out of space (for compute nodes: less than 1 GB of free space).
-
Software RAID is not fully synced
-
Software RAID <disk_name> on node <hostname> is <value>% synced.
-
Systemd service is flapping
-
Systemd service <service_name> on node <hostname> has changed its state more than 5 times in 5 minutes or 15 times in one hour.
-
SMART Media Wearout warning
-
Disk <disk_name> on node <hostname> is almost worn out and may fail soon. Consider replacement.
The alert is based on the
smart_media_wearout_indicator
metric. For details on the Media Wearout Indicator S.M.A.R.T. attribute, refer to Disk health analyzers. -
SMART Media Wearout critical
-
Disk <disk_name> on node <hostname> is worn out and will fail soon. Consider replacement.
The alert is based on the
smart_media_wearout_indicator
metric. For details on the Media Wearout Indicator S.M.A.R.T. attribute, refer to Disk health analyzers.
Network interface alerts
-
Network warning
-
Network interface “<iface_name>” has incorrect settings: <duplex> duplex and <speed> speed.
-
Network warning
-
Network interface “<iface_name>” on node “<hostname>” is missing important features (or has them disabled): “<feature_name>”.
-
Network warning
-
Network interface “<iface_name>” on node “<hostname>” is not in the full duplex mode.
-
Network warning
-
Network interface “<iface_name>” on node “<hostname>” has speed lower than the minimally required 1 Gbps.
-
Network warning
-
Network interface “<iface_name>” on node “<hostname>” has an undefined speed.
-
Network interface is flapping
-
Network interface <iface_name> on node <hostname> is flapping.
-
Network bond is not redundant
-
Network bond <iface_name> on node <hostname> is missing <number> subordinate interface(s).
Data-in-transit encryption alerts
-
Enabling IPv6 mode takes too much time
-
Operation to enable the IPv6 mode is running for more than 1 hour. Please contact the technical support.
- Сancel the operation to enable the IPv6 mode by using the
vinfra cluster network encryption cancel
command. - Retry the operation.
- If you cannot troubleshoot the problem, contact the technical support team.
- Сancel the operation to enable the IPv6 mode by using the
-
Enabling traffic encryption takes too much time
-
Operation to enable traffic encryption is running for more than 1 hour. Please contact the technical support.
- Cancel the operation to enable traffic encryption by using the
vinfra cluster network encryption cancel
command. - Retry the operation.
- If you cannot troubleshoot the problem, contact the technical support team.
- Cancel the operation to enable traffic encryption by using the
-
System configuration is not optimal for traffic encryption
-
Traffic encryption is enabled but the storage network is not in the IPv6 mode. Switch on the IPv6 configuration, as described in the product documentation.
- Disable traffic encryption for the storage network, and then enable it again, as described in Enabling and disabling data-in-transit encryption.
- If you cannot troubleshoot the problem, contact the technical support team.
-
Node IPsec certificate will expire in less than 7 days
-
IPsec certificate for node <hostname> with ID <id> will expire in less than 7 days. Renew the certificate, as described in the product documentation, or contact the technical support.
- Follow the instructions in Renewing encryption certificates to renew the certificate.
- If you cannot troubleshoot the problem, contact the technical support team.
-
Node IPsec certificate has expired
-
IPsec certificate for node <hostname> with ID <id> has expired. Renew the certificate, as described in the product documentation, or contact the technical support.
- Follow the instructions in Renewing encryption certificates to renew the certificate.
- If you cannot troubleshoot the problem, contact the technical support team.