3.5. Managing Cluster Parameters

This section explains what cluster parameters are and how you can configure them with the vstorage utility.

3.5.1. Cluster Parameters Overview

The cluster parameters control creating, locating, and managing replicas for data chunks in a Virtuozzo Storage cluster. All parameters can be divided into three main groups of parameters: replication, encoding, location.

The table below briefly describes some of the cluster parameters. For more information on the parameters and how to configure them, see the following sections.

Parameter Description
Replication Parameters
Normal Replicas The number of replicas to create for a data chunk, from 1 to 15. Recommended: 3.
Minimum Replicas The minimum number of replicas for a data chunk, from 1 to 15. Recommended: 2.
Location Parameters
Failure Domain A placement policy for replicas, can be room, row, rack, host (default), or disk (CS).
Tier Storage tiers, from 0 to 3 (0 by default).

3.5.2. Configuring Replication Parameters

The cluster replication parameters define the following:

  • The normal number of replicas of a data chunk. When a new data chunk is created, Virtuozzo Storage automatically replicates it until the normal number of replicas is reached.
  • The minimum number of replicas of a data chunk (optional). During the life cycle of a data chunk, the number of its replicas may vary. If a lot of chunk servers go down it may fall below the defined minimum. In such a case, all write operations to the affected replicas are suspended until their number reaches the minimum value.

To check the current replication parameters applied to a cluster, you can use the vstorage get-attr command. For example, if your cluster is mounted to the /vstorage/stor1 directory, you can run the following command:

# vstorage get-attr /vstorage/stor1
connected to MDS#1
File: '/vstorage/stor1'
Attributes:
  ...
  replicas=1:1
  ...

As you can see, the normal and minimum numbers of chunk replicas are set to 1.

Initially, any cluster is configured to have only 1 replica per each data chunk, which is sufficient to evaluate the Virtuozzo Storage functionality using one server only. In production, however, to provide high availability for your data, you are recommended to

  • configure each data chunk to have at least 3 replicas,
  • set the minimum number of replicas to 2.

Such a configuration requires at least 3 chunk servers to be set up in the cluster.

To configure the current replication parameters so that they apply to all virtual machines and Containers in your cluster, you can run the vstorage set-attr command on the directory to which the cluster is mounted. For example, to set the recommended replication values to the stor1 cluster mounted to /vstorage/stor1, set the normal number of replicas for the cluster to 3:

# vstorage set-attr -R /vstorage/stor1 replicas=3

The minimum number of replicas will be automatically set to 2 by default.

Note

For information on how the minimum number of replicas is calculated, see the vstorage-set-attr man page.

Along with applying replication parameters to the entire contents of your cluster, you can also configure them for specific directories and files. For example:

# vstorage set-attr -R /vstorage/stor1/private/MyCT replicas=3

3.5.3. Configuring Encoding Parameters

As a better alternative to replication, Virtuozzo Storage can provide data redundancy by means of erasure coding. With it, Virtuozzo Storage breaks the incoming data stream into fragments of certain size, then splits each fragment into a certain number (M) of 1-megabyte pieces and creates a certain number (N) of parity pieces for redundancy. All pieces are distributed among M+N storage nodes, that is, one piece per node. On storage nodes, pieces are stored in regular chunks but such chunks are not replicated as redundancy is already achieved. The cluster can survive failure of any N storage nodes without data loss.

The values of M and N are indicated in the names of erasure coding redundancy modes. For example, in the 5+2 mode, the incoming data is broken into 5MB fragments, each fragment is split into five 1MB pieces and two more 1MB parity pieces are added for redundancy. In addition, if N is 2, the data is encoded using the RAID6 scheme, and if N is greater than 2, Reed-Solomon erasure codes are used.

It is recommended to use the following erasure coding redundancy modes (M+N):

  • 1+0,
  • 3+2,
  • 5+2,
  • 7+2,
  • 17+3.

Encoding is configured for directories. For example:

# vstorage set-attr -R /vstorage/stor1 encoding=5+2

After encoding is enabled, the redundancy mode cannot be changed back to replication. However, you can switch between different encoding modes for the same directory.

3.5.4. Configuring Failure Domains

A failure domain is a set of services which can fail in a correlated manner. Due to correlated failures it is very critical to scatter data replicas across different failure domains for data availability. Failure domain examples include:

  • A single disk (the smallest possible failure domain). For this reason, Virtuozzo Storage never places more than 1 data replica per disk or chunk server (CS).
  • A single host running multiple CS services. When such a host fails (e.g., due to a power outage or network disconnect), all CS services on it become unavailable at once. For this reason, Virtuozzo Storage is configured by default to make sure that a single host never stores more than 1 chunk replica (see Defining Failure Domains below).
  • Larger-scale multi-rack cluster setups introduce additional points of failure like per-rack switches or per-rack power units. In this case, it is important to configure Virtuozzo Storage to store data replicas across such failure domains to prevent data unavailability on massive correlated failures of a single domain.

3.5.4.1. Failure Domain Topology

Every Virtuozzo Storage service component has topology information assigned to it. Topology paths define a logical tree of components’ physical locations consisting of 5 identifiers: room.row.rack.host_ID.cs_ID.

../_images/image004.png

The first 3 topology path components, room.row.rack, can be configured in the /etc/vstorage/location configuration files (for more information, see the man page for vstorage-config-files). The last 2 components, host_ID.cs_ID, are generated automatically:

  • host_ID is a unique, randomly generated host identifier created during installation and located at /etc/vstorage/host_id.
  • cs_ID is a unique service identifier generated at CS creation.

Note

To view the current services topology and disk space available per location, run the vstorage top command and press w.

3.5.4.2. Defining Failure Domains

Based on the levels of hierarchy described above, you can use the vstorage set-attr command to define failure domains for proper file replica allocation:

# vstorage -c <cluster_name> set-attr -R -p /failure-domain=<disk|host|rack|row|room>

where

  • disk means that only 1 replica is allowed per disk or chunk server,
  • host means that only 1 replica is allowed per host (default),
  • rack means that only 1 replica is allowed per rack,
  • row means that only 1 replica is allowed per row,
  • room means that only 1 replica is allowed per room.

You should use the same configuration for all cluster files as it simplifies the analysis and is less error-prone.

As an example, do the following to configure a 5-rack setup:

  1. Assign 0.0.1 to /etc/vstorage/location on all hosts from the first rack, 0.0.2 on all hosts from second rack, and so on.

  2. Create 5 metadata servers: 1 on any host from the first rack, 1 on any host from second rack, and so on.

  3. Configure Virtuozzo Storage to have only 1 replica per rack (assuming that the cluster name is stor1):

    # vstorage -c stor1 set-attr -R -p /failure-domain=rack
    
  4. Create CS services with the vstorage-make-cs command as described in Setting Up Chunk Servers.

3.5.4.3. Changing Host Location

Once a host has started running services in the cluster, its topology is cached in the MDS and cannot be changed. All new services created on the host will use that cached information. If you need to modify the host location information, do the following:

  1. Kill and remove CS and client services running on the host with the vstorage-rm-cs and umount commands as described in Removing Chunk Servers, respectively.
  2. Set /etc/vstorage/host_id to another unique ID, e.g., generated with /dev/urandom.
  3. Adjust /etc/vstorage/location as required.
  4. Recreate the CS and client services: mount the cluster and create new CS instances using the vstorage-make-cs command as described in Setting Up Chunk Servers, respectively.

3.5.4.4. Recommendations on Failure Domains

Important

Do not use failure domain disk simultaneously with journaling SSDs. In this case, multiple replicas may happen to be located on disks serviced by the same journaling SSD. If that SSD fails, all replicas that depend on journals located on it will be lost. As a result, your data may be lost.

  • For the flexibility of Virtuozzo Storage allocator and rebalancing mechanisms, it is always recommended to have at least 5 failure domains configured in a production setup (hosts, racks, etc.). Reserve enough disk space on each failure domain so if a domain fails it can be recovered to healthy ones.
  • When MDS services are created, the topology and failure domains must be taken into account manually. That is, in multi-rack setups, metadata servers should be created in different racks (5 MDSes in total).
  • At least 3 replicas are recommended for multi-rack setups.
  • Huge failure domains are more sensitive to total disk space imbalance. For example, if a domain has 5 racks, with 10 TB, 20 TB, 30 TB, 100 TB, and 100 TB total disk space, it will not be possible to allocate (10+20+30+100+100)/3 = 86 TB of data in 3 replicas. Instead, only 60 TB will be allocatable, as the low-capacity racks will be exhausted sooner, and no 3 domains will be available for data allocation, while the largest racks (the 100TB ones) will still have free space
  • If a huge domain fails and goes offline, Virtuozzo Storage will not perform data recovery by default, because replicating a huge amount of data may take longer than domain repairs. This behavior managed by the global parameter mds.wd.max_offline_cs_hosts (configured with vstorage-config) which controls the number of failed hosts to be considered as a normal disaster worth recovering in the automatic mode
  • Failure domains should be similar in terms of I/O performance to avoid imbalance. For example, avoid setups in which failure-domain is set to rack, all racks but one have 10 Nodes each and one rack has only 1 Node. Virtuozzo Storage will have to repeatedly save a replica to this single Node, reducing overall performance
  • Depending on the global parameter mds.alloc.strict_failure_domain (configured with vstorage-config), the domain policy can be strict (default) or advisory. Tuning this parameter is highly not recommended unless you are absolutely sure of what you are doing.

3.5.5. Using Storage Tiers

This section describes storage tiers used in Virtuozzo Storage clusters and provides information of how to configure and monitor them.

3.5.5.1. What Are Storage Tiers

Storage tiers represent a way to organize storage space. You can use them to keep different categories of data on different chunk servers. For example, you can use high-speed solid-state drives to store performance-critical data instead of caching cluster operations.

3.5.5.2. Configuring Storage Tiers

To assign disk space to a storage tier, do this:

  1. Assign all chunk servers with SSD drives to the same tier. You can do this when setting up a chunk server (see Stage 2: Creating a Chunk Server for details).

    Note

    For information on recommended SSD drives, see Using SSD Drives.

  2. Assign the frequently accessed directories and files to tier 1 with the vstorage set-attr command. For example:

    # vstorage set-attr -R /vstorage/stor1/private/MyCT tier=1
    

    This command recursively assigns the directory /vstorage/stor1/private/MyCT and its contents to tier 1.

When assigning storage to tiers, have in mind that faster storage drives should be assigned to higher tiers. For example, you can use tier 0 for backups and other cold data (CS without SSD journals), tier 1 for virtual environments—a lot of cold data but fast random writes (CS with SSD journals), tier 2 for hot data (CS on SSD), journals, caches, specific virtual machine disks, and such.

This recommendation is related to how Virtuozzo Storage works with storage space. If a storage tier runs out of free space, Virtuozzo Storage will attempt to temporarily use a lower tier. If you add more storage to the original tier later, the data, temporarily stored elsewhere, will be moved to the tier where it should have been stored originally.

For example, if you try to write data to the tier 2 and it is full, Virtuozzo Storage will attempt to write that data to tier 1, then to tier 0. If you add more storage to tier 2 later, the aforementioned data, now stored on the tier 1 or 0,  will be moved back to the tier 2 where it was meant to be stored originally.

Automatic Data Balancing

To maximize the I/O performance of chunk servers in a cluster, Virtuozzo Storage automatically balances CS load by moving hot data chunks from hot chunk servers to colder ones.

A chunk server is considered hot if its request queue depth exceeds the cluster-average value by 40% or more (see example below). With data chunks, “hot” means “most requested”.

The hotness (i.e. request queue depth) of chunk servers is indicated by the QDEPTH parameter shown in the output of vstorage top and vstorage stat commands. For example:

...
 IO QDEPTH: 0.1 aver, 1.0 max; 1 out of 1 hot CS balanced     46 sec ago
...
 CSID STATUS     SPACE  AVAIL  REPLICAS   UNIQUE IOWAIT IOLAT(ms) QDEPTH HOST      BUILD_VERSION
 1025 active     1007.3 156.8G     7142        0    10%     1/117    0.3 10.31.240.167 6.0.11-10
 1026 active     1007.3 156.8G     7267        0    11%     0/225    0.1 10.31.240.167 6.0.11-10
 1027 active     1007.3 156.8G     7151        0     2%      0/10    0.1 10.31.240.167 6.0.11-10
 1028 active     1007.3 156.8G     7285        0    13%     1/141    1.0 10.31.240.167 6.0.11-10
...

In the output, the IO QDEPTH line shows the average and maximum request queue depth values in the entire cluster for the last 60 seconds. The QDEPTH column shows average request queue depth values for each CS for the last 5 seconds.

Each 60 seconds, the hottest data chunk is moved from a hot CS to one with a shorter request queue.

3.5.5.3. Monitoring Storage Tiers

You can monitor disk space assigned to each storage tier with the top utility in the verbose mode (enabled by pressing v). Typical output may look like this:

../_images/image005.png

3.5.6. Changing Virtuozzo Storage Cluster Network

Before moving your cluster to a new network, consider the following:

  • Changing the cluster network results in a brief downtime for the period when more than half of the MDS servers are unavailable.
  • It is highly recommended to back up all MDS repositories before changing the cluster network.

To change the Virtuozzo Storage cluster network, do the following on each node in the cluster where an MDS service is running:

  1. Stop the MDS service:

    # systemctl stop vstorage-mdsd.target
    
  2. Specify new IP addresses for all metadata servers in the cluster with the command vstorage configure-mds -r <MDS_repo> -n <MDS_ID@new_IP_address>[:port] [-n ...], where:

    • <MDS_repo> is the repository path of the MDS on the current node.

    • Each <MDS_ID@new_IP_address> pair is an MDS identifier and a corresponding new IP address. For example, for a cluster with 5 metadata servers:

      # vstorage -c stor1 configure-mds -r /vstorage/stor1-cs1/mds/data -n 1@10.10.20.1 \
      -n 2@10.10.20.2 -n 3@10.10.20.3 -n 4@10.10.20.4 -n 5@10.10.20.5
      

    Note

    1. You can obtain the identifier and repository path for the current MDS with the vstorage list-services -M command.
    2. If you omit the port, the default port 2510 will be used.
  3. Start the MDS service:

    # systemctl start vstorage-mdsd.target
    

3.5.7. Enabling Online Compacting of Virtual Machines

Online compacting of virtual machines on Virtuozzo Storage in the replication mode allows reclaiming disk space no longer occupied by data by means of the FALLOC_FL_PUNCH_HOLE flag. Online compacting is based on triggering the TRIM command from inside a guest. Windows guests have the feature enabled by default, while for Linux guests it is enabled with guest tool installation.

Note

Online compacting works by default unless the discard flag is not set to unmap for VM’s disk drives.

To enable online compacting for your Virtuozzo Storage cluster, do the following:

  1. Update all cluster nodes to Virtuozzo 7 Update 5.

  2. Restart updated cluster nodes one by one.

  3. Run the following command on any cluster node:

    # vstorage set-config "gen.do_punch_hole=1"
    

    Warning

    Running the command before updating all the chunk servers will result in data corruption!

Note

To reclaim unused space accumulated before online compacting was enabled (e.g., from VMs created on Virtuozzo 7 Update 4 and older), create a file inside the VM with size comparable to that of the unused space, then remove it.