Failure domains
The idea behind failure domains is to define a scope (for example, a rack) which can fail, while its data will still be available. If we choose the rack failure domain, the cluster data will tolerate a failure of one rack: the other racks will provide for the data availability. If we choose the host failure domain, the loss of an entire server would not result in the loss of data availability.
To provide high availability, Virtuozzo Hybrid Infrastructure spreads data replicas evenly across failure domains, according to a replica placement policy. The following policies are available:
- Disk, the smallest possible failure domain. Under this policy, Virtuozzo Hybrid Infrastructure never places more than one data replica per disk. While protecting against disk failure, this option may still result in data loss if data replicas happen to be on different disks of the same host and it fails. This policy should be used with one-node clusters.
- Host as a failure domain. Under this policy, Virtuozzo Hybrid Infrastructure never places more than one data replica per host. So if a storage node fails (an operating system crash) and all its disks become unavailable, the data is still accessible from the healthy nodes.
- Rack as a failure domain. Under this policy, Virtuozzo Hybrid Infrastructure never places more than one data replica per rack. So if a single rack fails (a failure of top-of-rack switch) and all the nodes in it become unavailable, the data is still accessible from the other racks.
- Row as a failure domain. Under this policy, Virtuozzo Hybrid Infrastructure never places more than one data replica per row. So if a single row fails (a failure of a single power source) and all the racks in it become unavailable, the data is still accessible from the other rows.
- Room as a failure domain. Under this policy, Virtuozzo Hybrid Infrastructure never places more than one data replica per room. So if a single room fails (a power outage) and all the rows in it become unavailable, the data is still accessible from the other rooms.
When selecting a failure domain, consider the following recommendations:
- Make sure the metadata services are distributed among the locations. For example, if you choose a room as the failure domain, and distribute the data across several rooms evenly, you must distribute metadata services too. If you put all the metadata services in one room and it fails due to a power outage, the cluster will not function properly.
- To select a location as the failure domain, you need to have several locations of the kind so that a service or the data can move from one failure domain to another, such as from one rack to another. For example, if you want to choose the rack failure domain with the redundancy 2 replicas or encoding 1+1, make sure you have at least two racks with healthy nodes assigned to the cluster.
- The disk space should be distributed evenly between the failure domains. For example, if you select the rack failure domain, equal disk space should be available on each of the racks. The allocatable disk space in each rack is set to the disk space on the smallest rack. The reason is that each rack should store one replica for a data chunk. So once the disk space on the smallest rack runs out, no more chunks in the cluster can be created until a new rack is added or the replication factor is decreased. Huge failure domains are more sensitive to total disk space imbalance. For example, if a domain has 5 racks, with 10 TB, 20 TB, 30 TB, 100 TB, and 100 TB total disk space, it will not be possible to allocate (10+20+30+100+100)/3 = 86 TB of data in 3 replicas. Instead, only 60 TB will be allocatable, as the low-capacity racks will be exhausted sooner. At that, the largest racks (the 100 TB ones) will still have free unallocatable space.