Infrastructure alerts

The infrastructure alerts that are generated and displayed in the admin panel can be related to the management node and admin panel, license and updates, cluster connectivity, as well as to a specific node, its disks, and network interfaces.

Management node and admin panel alerts

High availability for the admin panel must be configured

Configure high availability for the admin panel in Settings > System settings > High availability configuration. Otherwise the admin panel will be a single point of failure.

Management node backup does not exist

Management node backup is older than <number_of_days> days.

Management node backup does not exist

The last management node backup has failed, does not exist, or is too old.

Changes to the management database are not replicated

Changes to the management database are not replicated to the node "<hostname>" because it is offline. Check the node's state and connectivity.

Changes to the management database are not replicated

Changes to the management database are not replicated to the node "<hostname>". Please contact the technical support.

Management node HA has four nodes

The management node HA configuration has four nodes. It is recommended to have three or five nodes included.

License alerts

License is not loaded

License expired

Licensed storage capacity is low

Licensed storage capacity is critically low

Cluster is out of space

Update alerts

Software updates exist

Update check failed

Multiple update checks failed

Update download failed

Node update failed

Update failed

Cluster update failed

Entering maintenance for update failed

Cluster connectivity alerts

Cluster network connectivity problem

All nodes have network connectivity problems: unstable connectivity via network "<network_name>" due to packet loss.

Cluster network connectivity problem

All nodes have network connectivity problems: no connectivity via network "<network_name>".

Node network connectivity problem

Node "<hostname>" has network connectivity problems: unstable connectivity via network "<network_name>" due to the loss of all MTU-sized packets.

Node network connectivity problem

Node "<hostname>" has network connectivity problems: unstable connectivity via network "<network_name>" due to the loss of some MTU-sized packets.

Node network connectivity problem

Node "<hostname>" has network connectivity problems: unstable connectivity via network "<network_name>" due to packet loss.

Node network connectivity problem

Node "<hostname>" has network connectivity problems: no connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>".

Node network connectivity problem

Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to the loss of all MTU-sized packets.

Node network connectivity problem

Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to packet loss.

Node network connectivity problem

Node "<hostname>" has network connectivity problems: unstable connectivity to node "<hostname>" with interface "<iface>" via interface "<iface>" due to the loss of some MTU-sized packets.

MTU mismatch

Some interfaces have MTU that differs from other interfaces in the same network: network "<network_name>" interface@host "<iface> @ <hostname>".

Node alerts

Node is offline

Node “<hostname>” is offline.

Node got offline too many times

Node “<hostname>” got offline too many times last hour.

Kernel is outdated

Node “<hostname>” is not running the latest kernel.

OOM killer triggered

OOM killer has been triggered on node “<hostname>”.

Time is not synced

Time on node “<hostname>” differs from time on backend node by more than 5 seconds.

No Internet access

Cluster node <hostname> cannot reach the repository. Make sure that all cluster nodes have Internet access.

Incompatible hardware detected

Incompatible hardware detected on node "<hostname>": <hardware_list>. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled.

Swap space is running low

<swap_proportion>% of swap is used on node "<hostname>".

Node has high CPU usage

Node <hostname> has CPU usage higher than 90%. The current value is <value>%.

Node has high memory usage

Node <hostname> has memory usage higher than 95%. The current value is <value>%.

Node has high disk I/O usage

Disk /dev/<disk_name> on node <hostname> has I/O usage higher than 85%. The current value is <value>%.

Node has high receive packet loss rate

Node <hostname> has <value> receive packet loss rate reported by job <job_name>.

Node has high transmit packet loss rate

Node <hostname> has <value> transmit packet loss rate reported by job <job_name>.

Node has high receive packet error rate

Node <hostname> has <value> receive packet error rate reported by job <job_name>.

Node has high transmit packet error rate

Node <hostname> has <value> transmit packet error rate reported by job <job_name>.

Reached "node crash per hour" threshold

Node <hostname> with shaman node ID <id> has reached the "node crash per hour" threshold.

OOM-kill event detected

OOM-kill event detected on node <hostname> at least once for the last 24 hours. You need to check memory consumption.

Node failed to return to operation

Node <hostname> has failed to automatically return to operation within 30 minutes after a crash. Check the node's hardware, and then try returning it to operation manually.

Disk alerts

S.M.A.R.T. warning

Disk “<disk_name>”(<serial>) on node “<hostname>” has failed a S.M.A.R.T. check.

Disk error

Disk “<disk_name>” (<serial>) failed on node “<hostname>”.

Disk is out of space

Root partition on node “<hostname>” is running out of space (less than 10% of free space).

Software RAID is not fully synced

Software RAID <disk_name> on node <hostname> is <value>% synced.

Systemd service is flapping

Systemd service <service_name> on node <hostname> has changed its state more than 5 times in 5 minutes or 15 times in one hour.

Network interface alerts

Network warning

Network interface “<iface_name>” has incorrect settings: <duplex> duplex and <speed> speed.

Network warning

Network interface “<iface_name>” on node “<hostname>” is missing important features (or has them disabled): “<feature_name>”.

Network warning

Network interface “<iface_name>” on node “<hostname>” is not in the full duplex mode.

Network warning

Network interface “<iface_name>” on node “<hostname>” has speed lower than the minimally required 1 Gbps.

Network warning

Network interface “<iface_name>” on node “<hostname>” has an undefined speed.

Network interface is flapping

Network interface <iface_name> on node <hostname> is flapping.

Network bond is not redundant

Network bond <iface_name> on node <hostname> is missing <number_of_ifaces> subordinate interface(s).

Data-in-transit encryption alerts

Enabling IPv6 mode takes too much time

Enabling traffic encryption takes too much time

System configuration is not optimal for traffic encryption

Node IPsec certificate will expire in less than 7 days

Node IPsec certificate has expired