The storage cluster is self-healing. If a node or disk fails, a cluster will automatically try to restore the lost data, that is, rebuild itself.
For a cluster to be able to rebuild itself, it must have at least:
As many healthy nodes as required by the redundancy mode
Consider the following example. In a cluster that works in the 5+2 erasure coding mode and has seven nodes (that is, the minimum), each replica of user data is distributed to each of the 5+2 nodes for redundancy. If one or two nodes fail, the user data will not be lost, but the cluster will become degraded and will not be able to rebuild itself until at least seven nodes are healthy again (until you add the missing nodes). For comparison, in a cluster that works in the 5+2 erasure coding mode and has ten nodes, each replica of user data is distributed to the random 5+2 nodes out of ten, to even out the load on CSes. If up to three nodes fail, such a cluster will still have enough nodes to rebuild itself.
Enough free space to accommodate as much data as any one node can store
Consider the following example. In a cluster that has ten 10 TB nodes, at least 1 TB on each node should be kept free, so if a node fails, its 9 TB of data can be rebuilt on the remaining nine nodes. If, however, a cluster has ten 10 TB nodes and one 20 TB node, each smaller node should have at least 2 TB free in case the largest node fails (while the largest node should have 1 TB free).
The rebuilding process consists of several steps. Every CS sends a heartbeat message to an MDS every 5 seconds. If a heartbeat is not sent, the CS is considered inactive and the MDS informs all cluster components that they stop requesting operations on its data. If no heartbeats are received from a CS for 15 minutes, the MDS considers that CS offline and starts cluster rebuilding. In the process, the MDS finds CSes that do not have replicas of the lost data and restores the data—one replica at a time—as follows:
- If replication is used, the existing replicas of a degraded chunk are locked (to make sure all replicas remain identical) and one is copied to the new CS. If at this time a client needs to read some data that has not been rebuilt yet, it reads any remaining replica of that data.
- If erasure coding is used, the new CS requests almost all of the remaining data pieces to rebuild the missing ones. If at this time a client needs to read some data that has not yet been rebuilt, that data is rebuilt out of turn and then read.
Self-healing requires more network traffic and CPU resources if replication is used. On the other hand, rebuilding with erasure coding is slower.
If a node or disk goes offline during maintenance, cluster self-healing is delayed, to save cluster resources. The default delay is 30 minutes. You can adjust it by setting the
mds.wd.offline_tout_mnt parameter, in milliseconds, with the
vstorage -c <cluster_name> set-config command.
Two recommendations that help smooth out rebuilding overhead:
- To simplify rebuilding, keep uniform disk counts and capacity sizes on all nodes.
- Rebuilding places additional load on the network, and increases the latency of read and write operations. The more network bandwidth the cluster has, the faster rebuilding will be completed and bandwidth freed up.