Prev Next

3.5. Managing Cluster Parameters¶

This section explains what cluster parameters are and how you can configure them with the vstorage utility.

3.5.1. Cluster Parameters Overview¶

The cluster parameters control creating, locating, and managing replicas for data chunks in a Storage cluster. All parameters can be divided into three main groups of parameters: replication, encoding, location.

The table below briefly describes some of the cluster parameters. For more information on the parameters and how to configure them, see the following sections.

Parameter	Description
Replication Parameters
Normal Replicas	The number of replicas to create for a data chunk, from 1 to 15. Recommended: 3.
Minimum Replicas	The minimum number of replicas for a data chunk, from 1 to 15. Recommended: 2.
Location Parameters
Failure Domain	A placement policy for replicas, can be host (default) or disk (CS).
Tier	Storage tiers, from 0 to 3 (0 by default).

3.5.2. Configuring Replication Parameters¶

The cluster replication parameters define the following:

The normal number of replicas of a data chunk. When a new data chunk is created, Storage automatically replicates it until the normal number of replicas is reached.
The minimum number of replicas of a data chunk (optional). During the life cycle of a data chunk, the number of its replicas may vary. If a lot of chunk servers go down it may fall below the defined minimum. In such a case, all write operations to the affected replicas are suspended until their number reaches the minimum value.

To check the current replication parameters applied to a cluster, you can use the vstorage get-attr command. For example, if your cluster is mounted to the /vstorage/stor1 directory, you can run the following command:

# vstorage get-attr /vstorage/stor1
connected to MDS#1
File: '/vstorage/stor1'
Attributes:
  <...>
  replicas=1:1
  <...>

As you can see, the normal and minimum numbers of chunk replicas are set to 1.

Initially, any cluster is configured to have only 1 replica per each data chunk, which is sufficient to evaluate the Storage functionality using one server only. In production, however, to provide high availability for your data, you are recommended to

Configure each data chunk to have at least 3 replicas.
Set the minimum number of replicas to 2.

Such a configuration requires at least 3 chunk servers to be set up in the cluster.

To configure the current replication parameters so that they apply to all virtual machines and containers in your cluster, you can run the vstorage set-attr command on the directory to which the cluster is mounted. For example, to set the recommended replication values to the stor1 cluster mounted to /vstorage/stor1, set the normal number of replicas for the cluster to 3:

# vstorage set-attr -R /vstorage/stor1 replicas=3

The minimum number of replicas will be automatically set to 2 by default.

Note

For information on how the minimum number of replicas is calculated, see the vstorage-set-attr man page.

Along with applying replication parameters to the entire contents of your cluster, you can also configure them for specific directories and files. For example:

# vstorage set-attr -R /vstorage/stor1/private/MyCT replicas=3

3.5.3. Configuring Encoding Parameters¶

As a better alternative to replication, Storage can provide data redundancy by means of erasure coding. With it, Storage breaks the incoming data stream into fragments of certain size, then splits each fragment into a certain number (M) of 1-megabyte pieces and creates a certain number (N) of parity pieces for redundancy. All pieces are distributed among M+N storage nodes, that is, one piece per node. On storage nodes, pieces are stored in regular chunks but such chunks are not replicated as redundancy is already achieved. The cluster can survive failure of any N storage nodes without data loss.

The values of M and N are indicated in the names of erasure coding redundancy modes. For example, in the 5+2 mode, the incoming data is broken into 5MB fragments, each fragment is split into five 1MB pieces and two more 1MB parity pieces are added for redundancy. In addition, if N is 2, the data is encoded using the RAID6 scheme, and if N is greater than 2, Reed-Solomon erasure codes are used.

It is recommended to use the following erasure coding redundancy modes (M+N):

1+0
3+2
5+2
7+2
17+3

Encoding is configured for directories. For example:

# vstorage set-attr -R /vstorage/stor1 encoding=5+2

After encoding is enabled, the redundancy mode cannot be changed back to replication. However, you can switch between different encoding modes for the same directory.

The general recommendation is to always have at least one more node in a cluster than required by the chosen redundancy scheme. For example, a cluster using replication with 3 replicas should have four nodes, and a cluster that works in the 7+2 erasure coding mode should have ten nodes. Such a cluster configuration has the following advantages:

The cluster will not be exposed to additional failures when in the degraded state. With one node down, the cluster may not survive another even single-disk failure without data loss.
You will be able to perform maintenance on cluster nodes that may be needed to recover a failed node (for example, for installing software updates).
In most cases, the cluster will have enough nodes to rebuild itself. In a cluster without a spare node, each replica of user data is distributed to each cluster node for redundancy. If one or two nodes go down, the user data will not be lost, but the cluster will become degraded and will only start self-healing after the failed nodes are back online. During its rebuilding process, the cluster may be exposed to additional failures until all of its nodes are healthy again.
You can replace and upgrade a cluster node without adding a new node to the cluster. A graceful release of a storage node is only possible if the remaining nodes in the cluster can comply with the configured redundancy scheme. You can, however, release a node forcibly without data migration, but it will make the cluster degraded and trigger the cluster self-healing.

Note

After upgrading a node in a mixed cluster, you cannot migrate VEs (virtual machines and containers) created in datastores with encoding EC 3+2, 5+2, 7+2, or 17+3 from VHS 7.5 Update 4 to VHS 7.5 Update 3. However, the migration of VEs created in local datastores and datastores with a 3-replica and 2-replica data redundancy mode is available. A mixed cluster is not supported and exists during the upgrade only.

3.5.4. Configuring Failure Domains¶

A failure domain is a set of services which can fail in a correlated manner. Due to correlated failures it is very critical to scatter data replicas across different failure domains for data availability. Failure domain examples include:

A single disk (the smallest possible failure domain). For this reason, Storage never places more than 1 data replica per disk or chunk server (CS).
A single host running multiple CS services. When such a host fails (e.g., due to a power outage or network disconnect), all CS services on it become unavailable at once. For this reason, Storage is configured by default to make sure that a single host never stores more than 1 chunk replica (see Defining Failure Domains below).

3.5.4.1. Failure Domain Topology¶

Every Storage service component has topology information assigned to it. Topology paths define a logical tree of components’ physical locations consisting of identifiers host_ID.cs_ID that are generated automatically:

host_ID is a unique, randomly generated host identifier created during installation and located at /etc/vstorage/host_id.
cs_ID is a unique service identifier generated at CS creation.

Note

To view the current services topology and disk space available per location, run the vstorage top command and press w.

3.5.4.2. Defining Failure Domains¶

Based on the levels of hierarchy described above, you can use the vstorage set-attr command to define failure domains for proper file replica allocation:

# vstorage -c <cluster_name> set-attr -R -p /failure-domain=<disk|host>

Where:

disk means that only 1 replica is allowed per disk or chunk server.
host means that only 1 replica is allowed per host (default).

You should use the same configuration for all cluster files as it simplifies the analysis and is less error-prone.

3.5.4.3. Recommendations on Failure Domains¶

Important

Do not use failure domain disk simultaneously with journaling SSDs. In this case, multiple replicas may happen to be located on disks serviced by the same journaling SSD. If that SSD fails, all replicas that depend on journals located on it will be lost. As a result, your data may be lost.

For the flexibility of Storage allocator and rebalancing mechanisms, it is always recommended to have at least 5 failure domains configured in a production setup. Reserve enough disk space on each failure domain so if a domain fails it can be recovered to healthy ones.
At least 3 replicas are recommended.
If a huge failure domain fails and goes offline, Storage will not perform data recovery by default, because replicating a huge amount of data may take longer than domain repairs. This behavior managed by the global parameter mds.wd.max_offline_cs_hosts (configured with vstorage-config) which controls the number of failed hosts to be considered as a normal disaster worth recovering in the automatic mode
Depending on the global parameter mds.alloc.strict_failure_domain (configured with vstorage-config), the domain policy can be strict (default) or advisory. Tuning this parameter is highly not recommended unless you are absolutely sure of what you are doing.

3.5.5. Using Storage Tiers¶

This section describes storage tiers used in Storage clusters and provides information of how to configure and monitor them.

3.5.5.1. What Are Storage Tiers¶

Storage tiers represent a way to organize storage space. You can use them to keep different categories of data on different chunk servers. For example, you can use high-speed solid-state drives to store performance-critical data instead of caching cluster operations.

3.5.5.2. Configuring Storage Tiers¶

To assign disk space to a storage tier, do the following:

Assign all chunk servers with SSD drives to the same tier. You can do this when setting up a chunk server (see Stage 2: Creating a Chunk Server for details).

Note

For information on recommended SSD drives, see Using SSD Drives.
Assign the frequently accessed directories and files to tier 1 with the vstorage set-attr command. For example:
```
# vstorage set-attr -R /vstorage/stor1/private/MyCT tier=1
```
This command recursively assigns the directory /vstorage/stor1/private/MyCT and its contents to tier 1.

When assigning storage to tiers, have in mind that faster storage drives should be assigned to higher tiers. For example, you can use tier 0 for backups and other cold data (CS without SSD journals), tier 1 for virtual environments—a lot of cold data but fast random writes (CS with SSD journals), tier 2 for hot data (CS on SSD), journals, caches, specific virtual machine disks, and such.

This recommendation is related to how Storage works with storage space. If a storage tier runs out of free space, Storage will attempt to temporarily use a lower tier. If you add more storage to the original tier later, the data, temporarily stored elsewhere, will be moved to the tier where it should have been stored originally.

For example, if you try to write data to the tier 2 and it is full, Storage will attempt to write that data to tier 1, then to tier 0. If you add more storage to tier 2 later, the aforementioned data, now stored on the tier 1 or 0, will be moved back to the tier 2 where it was meant to be stored originally.

Automatic Data Balancing

To maximize the I/O performance of chunk servers in a cluster, Storage automatically balances CS load by moving hot data chunks from hot chunk servers to colder ones.

A chunk server is considered hot if its request queue depth exceeds the cluster-average value by 40% or more (see example below). With data chunks, “hot” means “most requested”.

The hotness (i.e. request queue depth) of chunk servers is indicated by the QDEPTH parameter shown in the output of vstorage top and vstorage stat commands. For example:

...
 IO QDEPTH: 0.1 aver, 1.0 max; 1 out of 1 hot CS balanced     46 sec ago
...
 CSID STATUS     SPACE  AVAIL  REPLICAS   UNIQUE IOWAIT IOLAT(ms) QDEPTH HOST      BUILD_VERSION
 1025 active     1007.3 156.8G     7142        0    10%     1/117    0.3 10.31.240.167 6.0.11-10
 1026 active     1007.3 156.8G     7267        0    11%     0/225    0.1 10.31.240.167 6.0.11-10
 1027 active     1007.3 156.8G     7151        0     2%      0/10    0.1 10.31.240.167 6.0.11-10
 1028 active     1007.3 156.8G     7285        0    13%     1/141    1.0 10.31.240.167 6.0.11-10
...

In the output, the IO QDEPTH line shows the average and maximum request queue depth values in the entire cluster for the last 60 seconds. The QDEPTH column shows average request queue depth values for each CS for the last 5 seconds.

Each 60 seconds, the hottest data chunk is moved from a hot CS to one with a shorter request queue.

3.5.5.3. Monitoring Storage Tiers¶

You can monitor disk space assigned to each storage tier with the top utility in the verbose mode (enabled by pressing v). Typical output may look like this:

3.5.6. Changing Storage Cluster Network¶

Before moving your cluster to a new network, consider the following:

Changing the cluster network results in a brief downtime for the period when more than half of the MDS servers are unavailable.
It is highly recommended to back up all MDS repositories before changing the cluster network.

To change the Storage cluster network, do the following on each node in the cluster where an MDS service is running:

Stop the MDS service:
```
# systemctl stop vstorage-mdsd.target
```
Specify new IP addresses for all metadata servers in the cluster with the command vstorage configure-mds -r <MDS_repo> -n <MDS_ID@new_IP_address>[:port] [-n ...], where:
- <MDS_repo> is the repository path of the MDS on the current node.
- Each <MDS_ID@new_IP_address> pair is an MDS identifier and a corresponding new IP address. For example, for a cluster with 5 metadata servers:
```
# vstorage -c stor1 configure-mds -r /vstorage/stor1-cs1/mds/data -n 1@10.10.20.1 \
-n 2@10.10.20.2 -n 3@10.10.20.3 -n 4@10.10.20.4 -n 5@10.10.20.5
```
Note the following:
- You can obtain the identifier and repository path for the current MDS with the vstorage list-services -M command.
- If you omit the port, the default port 2510 will be used.
Start the MDS service:
```
# systemctl start vstorage-mdsd.target
```

3.5.7. Enabling Online Compacting of Virtual Machine Disks¶

Online compacting of virtual machine disks allows reclaiming disk space no longer occupied by data. It is based on triggering the TRIM command inside the VM once a week.

Online compacting is enabled by default for disks of Windows VMs. For disks of Linux VMs it is enabled when the guest tools are installed.

Virtual machine disks for which online compacting is enabled have the discard='unmap' flag set. Removing the flag from the virtual disk configuration disables online compacting for it.

Note

Online compacting may not work for virtio-blk devices even if the discard='unmap' flag is set.

For online compacting to work for VM disks located on Storage (in the replication mode), punch hole support also needs to be enabled for the Storage cluster file system. Do the following:

Make sure that all cluster nodes are updated to Virtuozzo Hybrid Server 7 Update 5 or newer. See Updating Nodes in Storage Clusters.
Set the FALLOC_FL_PUNCH_HOLE flag by running the following command on any cluster node:
```
# vstorage set-config "gen.do_punch_hole=1"
```
Warning

Running the command before updating all of the chunk servers will result in data corruption!

To reclaim unused space accumulated before online compacting was enabled (e.g., from VMs created on Virtuozzo Hybrid Server 7 Update 4 and older), create a file inside the VM with the size comparable to that of the unused space, then remove it. The space will be reclaimed as soon as the TRIM command is triggered.

Version 7.5 — Jan 22, 2025

Edit Print

Prev Next