Сluster rebuilding

The storage cluster is self-healing. If a node or disk fails, a cluster will automatically try to restore the lost data, that is, rebuild itself.

Rebuilding prerequisites

For a cluster to be able to rebuild itself, it must have at least:

Rebuilding process

The rebuilding process consists of several steps. Every CS sends a heartbeat message to an MDS every 5 seconds. If a heartbeat is not sent, the CS is considered inactive and the MDS informs all cluster components that they stop requesting operations on its data. If no heartbeats are received from a CS for 15 minutes, the MDS considers that CS offline and starts cluster rebuilding. In the process, the MDS finds CSes that do not have replicas of the lost data and restores the data—one replica at a time—as follows:

  • If replication is used, the existing replicas of a degraded chunk are locked (to make sure all replicas remain identical) and one is copied to the new CS. If at this time a client needs to read some data that has not been rebuilt yet, it reads any remaining replica of that data.
  • If erasure coding is used, the new CS requests almost all of the remaining data pieces to rebuild the missing ones. If at this time a client needs to read some data that has not yet been rebuilt, that data is rebuilt out of turn and then read.

Self-healing requires more network traffic and CPU resources if replication is used. On the other hand, rebuilding with erasure coding is slower.

If a node or disk goes offline during maintenance, cluster self-healing is delayed, to save cluster resources. The default delay is 30 minutes. You can adjust it by setting the mds.wd.offline_tout_mnt parameter, in milliseconds, with the vstorage -c <cluster_name> set-config command.

Recommendations for cluster rebuilding

Two recommendations that help smooth out rebuilding overhead:

  • To simplify rebuilding, keep uniform disk counts and capacity sizes on all nodes.
  • Rebuilding places additional load on the network, and increases the latency of read and write operations. The more network bandwidth the cluster has, the faster rebuilding will be completed and bandwidth freed up.