3.5. Managing Cluster Parameters¶
This section explains what cluster parameters are and how you can configure them with the
3.5.1. Cluster Parameters Overview¶
The cluster parameters control creating, locating, and managing replicas for data chunks in a Virtuozzo Storage cluster. All parameters can be divided into three main groups of parameters: replication, encoding, location.
The table below briefly describes some of the cluster parameters. For more information on the parameters and how to configure them, see the following sections.
|Normal Replicas||The number of replicas to create for a data chunk, from 1 to 15. Recommended: 3.|
|Minimum Replicas||The minimum number of replicas for a data chunk, from 1 to 15. Recommended: 2.|
|Failure Domain||A placement policy for replicas, can be room, row, rack, host (default), or disk (CS).|
|Tier||Storage tiers, from 0 to 3 (0 by default).|
3.5.2. Configuring Replication Parameters¶
The cluster replication parameters define the following:
- The normal number of replicas of a data chunk. When a new data chunk is created, Virtuozzo Storage automatically replicates it until the normal number of replicas is reached.
- The minimum number of replicas of a data chunk (optional). During the life cycle of a data chunk, the number of its replicas may vary. If a lot of chunk servers go down it may fall below the defined minimum. In such a case, all write operations to the affected replicas are suspended until their number reaches the minimum value.
To check the current replication parameters applied to a cluster, you can use the
vstorage get-attr command. For example, if your cluster is mounted to the
/vstorage/stor1 directory, you can run the following command:
# vstorage get-attr /vstorage/stor1 connected to MDS#1 File: '/vstorage/stor1' Attributes: ... replicas=1:1 ...
As you can see, the normal and minimum numbers of chunk replicas are set to 1.
Initially, any cluster is configured to have only 1 replica per each data chunk, which is sufficient to evaluate the Virtuozzo Storage functionality using one server only. In production, however, to provide high availability for your data, you are recommended to
- configure each data chunk to have at least 3 replicas,
- set the minimum number of replicas to 2.
Such a configuration requires at least 3 chunk servers to be set up in the cluster.
To configure the current replication parameters so that they apply to all virtual machines and Containers in your cluster, you can run the
vstorage set-attr command on the directory to which the cluster is mounted. For example, to set the recommended replication values to the
stor1 cluster mounted to
/vstorage/stor1, set the normal number of replicas for the cluster to 3:
# vstorage set-attr -R /vstorage/stor1 replicas=3
The minimum number of replicas will be automatically set to 2 by default.
For information on how the minimum number of replicas is calculated, see the
vstorage-set-attr man page.
Along with applying replication parameters to the entire contents of your cluster, you can also configure them for specific directories and files. For example:
# vstorage set-attr -R /vstorage/stor1/private/MyCT replicas=3
3.5.3. Configuring Encoding Parameters¶
As a better alternative to replication, Virtuozzo Storage can provide data redundancy by means of erasure coding. With it, Virtuozzo Storage breaks the incoming data stream into fragments of certain size, then splits each fragment into a certain number (M) of 1-megabyte pieces and creates a certain number (N) of parity pieces for redundancy. All pieces are distributed among M+N storage nodes, that is, one piece per node. On storage nodes, pieces are stored in regular chunks but such chunks are not replicated as redundancy is already achieved. The cluster can survive failure of any N storage nodes without data loss.
The values of M and N are indicated in the names of erasure coding redundancy modes. For example, in the 5+2 mode, the incoming data is broken into 5MB fragments, each fragment is split into five 1MB pieces and two more 1MB parity pieces are added for redundancy. In addition, if N is 2, the data is encoded using the RAID6 scheme, and if N is greater than 2, Reed-Solomon erasure codes are used.
It is recommended to use the following erasure coding redundancy modes (M+N):
Encoding is configured for directories. For example:
# vstorage set-attr -R /vstorage/stor1 encoding=5+2
After encoding is enabled, the redundancy mode cannot be changed back to replication. However, you can switch between different encoding modes for the same directory.
3.5.4. Configuring Failure Domains¶
A failure domain is a set of services which can fail in a correlated manner. Due to correlated failures it is very critical to scatter data replicas across different failure domains for data availability. Failure domain examples include:
- A single disk (the smallest possible failure domain). For this reason, Virtuozzo Storage never places more than 1 data replica per disk or chunk server (CS).
- A single host running multiple CS services. When such a host fails (e.g., due to a power outage or network disconnect), all CS services on it become unavailable at once. For this reason, Virtuozzo Storage is configured by default to make sure that a single host never stores more than 1 chunk replica (see Defining Failure Domains below).
- Larger-scale multi-rack cluster setups introduce additional points of failure like per-rack switches or per-rack power units. In this case, it is important to configure Virtuozzo Storage to store data replicas across such failure domains to prevent data unavailability on massive correlated failures of a single domain.
220.127.116.11. Failure Domain Topology¶
Every Virtuozzo Storage service component has topology information assigned to it. Topology paths define a logical tree of components’ physical locations consisting of 5 identifiers:
The first 3 topology path components,
room.row.rack, can be configured in the
/etc/vstorage/location configuration files (for more information, see the man page for
vstorage-config-files). The last 2 components,
host_ID.cs_ID, are generated automatically:
host_IDis a unique, randomly generated host identifier created during installation and located at
cs_IDis a unique service identifier generated at CS creation.
To view the current services topology and disk space available per location, run the
vstorage top command and press w.
18.104.22.168. Defining Failure Domains¶
Based on the levels of hierarchy described above, you can use the
vstorage set-attr command to define failure domains for proper file replica allocation:
# vstorage -c <cluster_name> set-attr -R -p /failure-domain=<disk|host|rack|row|room>
diskmeans that only 1 replica is allowed per disk or chunk server,
hostmeans that only 1 replica is allowed per host (default),
rackmeans that only 1 replica is allowed per rack,
rowmeans that only 1 replica is allowed per row,
roommeans that only 1 replica is allowed per room.
You should use the same configuration for all cluster files as it simplifies the analysis and is less error-prone.
As an example, do the following to configure a 5-rack setup:
/etc/vstorage/locationon all hosts from the first rack,
0.0.2on all hosts from second rack, and so on.
Create 5 metadata servers: 1 on any host from the first rack, 1 on any host from second rack, and so on.
Configure Virtuozzo Storage to have only 1 replica per rack (assuming that the cluster name is
# vstorage -c stor1 set-attr -R -p /failure-domain=rack
Create CS services with the
vstorage-make-cscommand as described in Setting Up Chunk Servers.
22.214.171.124. Changing Host Location¶
Once a host has started running services in the cluster, its topology is cached in the MDS and cannot be changed. All new services created on the host will use that cached information. If you need to modify the host location information, do the following:
- Kill and remove CS and client services running on the host with the
umountcommands as described in Removing Chunk Servers, respectively.
/etc/vstorage/host_idto another unique ID, e.g., generated with
- Recreate the CS and client services: mount the cluster and create new CS instances using the
vstorage-make-cscommand as described in Setting Up Chunk Servers, respectively.
126.96.36.199. Recommendations on Failure Domains¶
Do not use failure domain
disk simultaneously with journaling SSDs. In this case, multiple replicas may happen to be located on disks serviced by the same journaling SSD. If that SSD fails, all replicas that depend on journals located on it will be lost. As a result, your data may be lost.
- For the flexibility of Virtuozzo Storage allocator and rebalancing mechanisms, it is always recommended to have at least 5 failure domains configured in a production setup (hosts, racks, etc.). Reserve enough disk space on each failure domain so if a domain fails it can be recovered to healthy ones.
- When MDS services are created, the topology and failure domains must be taken into account manually. That is, in multi-rack setups, metadata servers should be created in different racks (5 MDSes in total).
- At least 3 replicas are recommended for multi-rack setups.
- Huge failure domains are more sensitive to total disk space imbalance. For example, if a domain has 5 racks, with 10 TB, 20 TB, 30 TB, 100 TB, and 100 TB total disk space, it will not be possible to allocate (10+20+30+100+100)/3 = 86 TB of data in 3 replicas. Instead, only 60 TB will be allocatable, as the low-capacity racks will be exhausted sooner, and no 3 domains will be available for data allocation, while the largest racks (the 100TB ones) will still have free space
- If a huge domain fails and goes offline, Virtuozzo Storage will not perform data recovery by default, because replicating a huge amount of data may take longer than domain repairs. This behavior managed by the global parameter
vstorage-config) which controls the number of failed hosts to be considered as a normal disaster worth recovering in the automatic mode
- Failure domains should be similar in terms of I/O performance to avoid imbalance. For example, avoid setups in which
failure-domainis set to
rack, all racks but one have 10 Nodes each and one rack has only 1 Node. Virtuozzo Storage will have to repeatedly save a replica to this single Node, reducing overall performance
- Depending on the global parameter
vstorage-config), the domain policy can be strict (default) or advisory. Tuning this parameter is highly not recommended unless you are absolutely sure of what you are doing.
3.5.5. Using Storage Tiers¶
This section describes storage tiers used in Virtuozzo Storage clusters and provides information of how to configure and monitor them.
188.8.131.52. What Are Storage Tiers¶
Storage tiers represent a way to organize storage space. You can use them to keep different categories of data on different chunk servers. For example, you can use high-speed solid-state drives to store performance-critical data instead of caching cluster operations.
184.108.40.206. Configuring Storage Tiers¶
To assign disk space to a storage tier, do this:
Assign all chunk servers with SSD drives to the same tier. You can do this when setting up a chunk server (see Stage 2: Creating a Chunk Server for details).
For information on recommended SSD drives, see Using SSD Drives.
Assign the frequently accessed directories and files to tier 1 with the
vstorage set-attrcommand. For example:
# vstorage set-attr -R /vstorage/stor1/private/MyCT tier=1
This command recursively assigns the directory
/vstorage/stor1/private/MyCTand its contents to tier 1.
When assigning storage to tiers, have in mind that faster storage drives should be assigned to higher tiers. For example, you can use tier 0 for backups and other cold data (CS without SSD journals), tier 1 for virtual environments—a lot of cold data but fast random writes (CS with SSD journals), tier 2 for hot data (CS on SSD), journals, caches, specific virtual machine disks, and such.
This recommendation is related to how Virtuozzo Storage works with storage space. If a storage tier runs out of free space, Virtuozzo Storage will attempt to temporarily use a lower tier. If you add more storage to the original tier later, the data, temporarily stored elsewhere, will be moved to the tier where it should have been stored originally.
For example, if you try to write data to the tier 2 and it is full, Virtuozzo Storage will attempt to write that data to tier 1, then to tier 0. If you add more storage to tier 2 later, the aforementioned data, now stored on the tier 1 or 0, will be moved back to the tier 2 where it was meant to be stored originally.
Automatic Data Balancing
To maximize the I/O performance of chunk servers in a cluster, Virtuozzo Storage automatically balances CS load by moving hot data chunks from hot chunk servers to colder ones.
A chunk server is considered hot if its request queue depth exceeds the cluster-average value by 40% or more (see example below). With data chunks, “hot” means “most requested”.
The hotness (i.e. request queue depth) of chunk servers is indicated by the
QDEPTH parameter shown in the output of
vstorage top and
vstorage stat commands. For example:
... IO QDEPTH: 0.1 aver, 1.0 max; 1 out of 1 hot CS balanced 46 sec ago ... CSID STATUS SPACE AVAIL REPLICAS UNIQUE IOWAIT IOLAT(ms) QDEPTH HOST BUILD_VERSION 1025 active 1007.3 156.8G 7142 0 10% 1/117 0.3 10.31.240.167 6.0.11-10 1026 active 1007.3 156.8G 7267 0 11% 0/225 0.1 10.31.240.167 6.0.11-10 1027 active 1007.3 156.8G 7151 0 2% 0/10 0.1 10.31.240.167 6.0.11-10 1028 active 1007.3 156.8G 7285 0 13% 1/141 1.0 10.31.240.167 6.0.11-10 ...
In the output, the
IO QDEPTH line shows the average and maximum request queue depth values in the entire cluster for the last 60 seconds. The
QDEPTH column shows average request queue depth values for each CS for the last 5 seconds.
Each 60 seconds, the hottest data chunk is moved from a hot CS to one with a shorter request queue.
220.127.116.11. Monitoring Storage Tiers¶
You can monitor disk space assigned to each storage tier with the
top utility in the verbose mode (enabled by pressing v). Typical output may look like this:
3.5.6. Changing Virtuozzo Storage Cluster Network¶
Before moving your cluster to a new network, consider the following:
- Changing the cluster network results in a brief downtime for the period when more than half of the MDS servers are unavailable.
- It is highly recommended to back up all MDS repositories before changing the cluster network.
To change the Virtuozzo Storage cluster network, do the following on each node in the cluster where an MDS service is running:
Stop the MDS service:
# systemctl stop vstorage-mdsd.target
Specify new IP addresses for all metadata servers in the cluster with the command
vstorage configure-mds -r <MDS_repo> -n <MDS_ID@new_IP_address>[:port] [-n ...], where:
<MDS_repo>is the repository path of the MDS on the current node.
<MDS_ID@new_IP_address>pair is an MDS identifier and a corresponding new IP address. For example, for a cluster with 5 metadata servers:
# vstorage -c stor1 configure-mds -r /vstorage/stor1-cs1/mds/data -n firstname.lastname@example.org \ -n email@example.com -n firstname.lastname@example.org -n email@example.com -n firstname.lastname@example.org
- You can obtain the identifier and repository path for the current MDS with the
vstorage list-services -Mcommand.
- If you omit the port, the default port 2510 will be used.
Start the MDS service:
# systemctl start vstorage-mdsd.target
3.5.7. Enabling Online Compacting of Virtual Machines¶
Online compacting of virtual machines on Virtuozzo Storage in the replication mode allows reclaiming disk space no longer occupied by data by means of the
FALLOC_FL_PUNCH_HOLE flag. Online compacting is based on triggering the TRIM command from inside a guest. Windows guests have the feature enabled by default, while for Linux guests it is enabled with guest tool installation.
Online compacting works by default unless the
discard flag is not set to
unmap for VM’s disk drives.
To enable online compacting for your Virtuozzo Storage cluster, do the following:
Update all cluster nodes to Virtuozzo 7 Update 5.
Restart updated cluster nodes one by one.
Run the following command on any cluster node:
# vstorage set-config "gen.do_punch_hole=1"
Running the command before updating all the chunk servers will result in data corruption!
To reclaim unused space accumulated before online compacting was enabled (e.g., from VMs created on Virtuozzo 7 Update 4 and older), create a file inside the VM with size comparable to that of the unused space, then remove it.