Fencing compute nodes

A compute node might fail due to a kernel crash, power outage, or might become unreachable over the network. A failed node is automatically fenced from scheduling new virtual machines on it. That is, this node is isolated from the compute cluster, and new VMs are created on other compute nodes. Once a failed node becomes available again, the system runs a health check, and if successful, automatically tries to return the node to operation. If a node cannot be unfenced within 30 minutes after a crash or fails three times within an hour, you need to check its hardware, and then return it to operation manually.

In case there is an ongoing VM evacuation when the node is back to operation, the process continues for those VMs already scheduled for evacuation. The remaining VMs are retained on the node.

If you experience problems launching VMs on a particular node, you can fence this node manually for troubleshooting. Such nodes need to be returned to operation manually

Limitations

  • The compute cluster can survive the failure of only one node.
  • Compute nodes that are in maintenance cannot be fenced.
  • Compute nodes with the single Controller role cannot be fenced.
  • If a compute node was fenced and then placed into the maintenance mode, it will return to the "Fenced" state after exiting maintenance.
  • Fenced nodes cannot be released from the compute cluster.

Prerequisites

  • Before fencing a node, ensure it has no running virtual machines. You can either stop such VMs or migrate them live to other nodes.

To fence a compute node

Admin panel

  1. Go to the Compute > Nodes screen, and then click a node.
  2. On the node right pane, click Fence.
  3. In the Fence node window, optionally specify the fencing reason, and then click Fence.

Once, the node is fenced, you can check the fencing reason in its details. If the node hosts stopped virtual machines, you can move them to healthy nodes by clicking Evacuate on the VM right pane.

Command-line interface

Use the following command:

vinfra service compute node fence [--force-down] [--reason <reason>] <node>
--force-down
Forcefully mark the node as down
--reason <reason>
The reason for disabling the compute node
<node>
Node ID or hostname

For example, to fence the node node003.vstoragedomain, run:

# vinfra service compute node fence node003.vstoragedomain

You can check that the node is successfully fenced in the vinfra service compute node list output:

# vinfra service compute node list
+--------------------------------------+------------------------+----------+--------------+
| id                                   | host                   | state    | roles        |
+--------------------------------------+------------------------+----------+--------------+
| 52565ca3-5893-8f6b-62ce-2f07b175b549 | node001.vstoragedomain | healthy  | - controller |
|                                      |                        |          | - compute    |
| 578ccd91-dd8d-50b0-e3a2-f6ccb5959159 | node002.vstoragedomain | healthy  | - compute    |
| 3ccf40b2-9437-b393-f02b-5282b188a4b3 | node003.vstoragedomain | disabled | - compute    |
+--------------------------------------+------------------------+----------+--------------+

In the command-line output, the fenced node has the disabled state.

To return a fenced node back to operation

Admin panel

  1. Go to the Compute > Nodes screen, and then click a fenced node.
  2. On the node right pane, click Return to operation.
  3. In the confirmation window, click Return.

Command-line interface

Use the following command:

vinfra service compute node unfence <node>
<node>
Node ID or hostname

For example, to return the node node003.vstoragedomain to operation, run:

# vinfra service compute node unfence node003.vstoragedomain