1.2. Planning Infrastructure for Virtuozzo Storage with GUI Management

To plan your infrastructure for Virtuozzo Storage managed via the web-based management panel, you will need to decide on the hardware configuration of each storage node, plan the storage networks, decide on the redundancy method (and mode) to use, and decide which data will be kept on which storage tier.

Information in this section is meant to help you complete all of these tasks.

1.2.1. Understanding Virtuozzo Storage Architecture

The fundamental component of Virtuozzo Storage is a cluster: a group of physical servers interconnected by network. Each server in a cluster is assigned one or more roles and typically runs services that correspond to these roles:

  • Storage role: chunk service or CS

  • Metadata role: metadata service or MDS

  • Network roles:

    • iSCSI access point service (iSCSI)

    • S3 gateway (access point) service (GW)

    • S3 name service (NS)

    • S3 object service (OS)

    • Web CP

    • SSH

  • Supplementary roles:

    • management

    • SSD cache

    • system

Any server in the cluster can be assigned a combination of storage, metadata, and network roles. For example, a single server can be an S3 access point, an iSCSI access point, and a storage node at once.

Each cluster also requires that a web-based management panel be installed on one (and only one) of the nodes. The panel enables administrators to manage the cluster.

1.2.1.1. Storage Role

Storage nodes run chunk services, store all the data in the form of fixed-size chunks, and provide access to these chunks. All data chunks are replicated and the replicas are kept on different storage nodes to achieve high availability of data. If one of the storage nodes fails, remaining healthy storage nodes continue providing the data chunks that were stored on the failed node.

Only a server with disks of certain capacity can be assigned the storage role (see Hardware Requirements).

1.2.1.2. Metadata Role

Metadata nodes run metadata services, store cluster metadata, and control how user files are split into chunks and where these chunks are located. Metadata nodes also ensure that chunks have the required amount of replicas and log all important events that happen in the cluster.

To ensure high availability of metadata, at least five metadata services must be running per cluster. In this case, if up to two metadata service fail, the remaining metadata services will still be controlling the cluster.

1.2.1.3. Network Roles (Storage Access Points)

Storage access points enable you to access data stored in Virtuozzo Storage clusters via the standard iSCSI and S3 protocols.

To benefit from high availability, access points should be set up on multiple node.

The following access points are currently supported:

  • iSCSI, allows you to use Virtuozzo Storage as a highly available block storage for virtualization, databases, office applications, and other needs.

  • S3, a combination of scalable and highly available services (collectively named Virtuozzo Object Storage) that allows you to use Virtuozzo Storage as a modern backend for solutions like OpenXchange AppSuite, and Dovecot. In addition, to developers of custom applications Virtuozzo Object Storage offers an Amazon S3-compatible API and compatibility with the S3 libraries for various programming languages, S3 browsers, and web browsers.

NFS, SMB, and other access point types are planned in the future releases of Virtuozzo Storage.

The following remote management roles are supported:

  • Web CP, allows you to access the web-based user interface from an external network.

  • SSH, allows you to connect to Virtuozzo Storage nodes via SSH.

1.2.1.4. Supplementary Roles

  • Management, provides a web-based management panel that enables administrators to configure, manage, and monitor Virtuozzo Storage clusters. Only one management panel is needed to create and manage multiple clusters (and only one is allowed per cluster).

  • SSD cache, boosts chunk write performance by creating write caches on selected solid-state drives (SSDs). It is recommended to also use such SSDs for metadata, see Metadata Role. The use of write cache may speed up write operations in the cluster by two and more times.

  • System, one disk per node that is reserved for the operating system and unavailable for data storage.

1.2.2. Planning Node Hardware Configurations

Virtuozzo Storage works on top of commodity hardware, so you can create a cluster from regular servers, disks, and network cards. Still, to achieve the optimal performance, a number of requirements must be met and a number of recommendations should be followed.

1.2.2.1. Hardware Requirements

The following table lists the minimal and recommended hardware for a single node in the cluster:

Type

Minimal

Recommended

CPU

Dual-core CPU

Intel Xeon E5-2620V2 or faster; at least one CPU core per 8 HDDs

RAM

4GB

16GB ECC or more, plus 0.5GB ECC per each HDD

Storage

System: Approximately 100GB SATA HDD (see Partitioning the Hard Drives for more details on the value that depends on the RAM size)

Metadata: 100GB SATA HDD (on the first five nodes in the cluster)

Storage: 100GB SATA HDD

System: Approximately 250GB SATA HDD (see Partitioning the Hard Drives for more details on the value that depends on the RAM size)

Metadata+Cache: One or more recommended enterprise-grade SSDs with power loss protection; 100GB or more capacity; and 75 MB/s sequential write performance per serviced HDD. For example, a node with 10 HDDs will need an SSD with at least 750 MB/s sequential write speed (on the first five nodes in the cluster)

Storage: Four or more HDDs or SSDs; 1 DWPD endurance minimum, 10 DWPD recommended

Disk controller

None

HBA or RAID

Network

1Gbps or faster network interface

Two 10Gbps network interfaces; dedicated links for internal and public networks

Sample configuration

 

Intel Xeon E5-2620V2, 32GB, 2xST1000NM0033, 32xST6000NM0024, 2xMegaRAID SAS 9271/9201, Intel X540-T2, Intel P3700 800GB

1.2.2.2. Hardware Recommendations

The following recommendations explain the benefits added by specific hardware in the hardware requirements table and are meant to help you configure the cluster hardware in an optimal way:

General hardware recommendations:

  • At least five nodes are required for a production environment. This is to ensure that the cluster can survive failure of two nodes without data loss.

  • One of the strongest features of Virtuozzo Storage is scalability. The bigger the cluster, the better Virtuozzo Storage performs. It is recommended to create production clusters from at least ten nodes for improved resiliency, performance, and fault tolerance in production scenarios.

  • Even though a cluster can be created on top of varied hardware, using nodes with similar hardware in each node will yield better cluster performance, capacity, and overall balance.

  • Any cluster infrastructure must be tested extensively before it is deployed to production. Such common points of failure as SSD drives and network adapter bonds must always be thoroughly verified.

  • It is not recommend for production to run Virtuozzo Storage on top of SAN/NAS hardware that has its own redundancy mechanisms. Doing so may negatively affect performance and data availability.

  • At least 20% of cluster capacity should be free to avoid possible data fragmentation and performance degradation.

  • During disaster recovery, Virtuozzo Storage may need additional disk space for replication. Make sure to reserve at least as much space as available on a single storage node.

Storage hardware recommendations:

  • Using the recommended SSD models may help you avoid loss of data. Not all SSD drives can withstand enterprise workloads and may break down in the first months of operation, resulting in TCO spikes.

    • SSD memory cells can withstand a limited number of rewrites. An SSD drive should be viewed as a consumable that you will need to replace after a certain time. Consumer-grade SSD drives can withstand a very low number of rewrites (so low, in fact, that these numbers are not shown in their technical specifications). SSD drives intended for Virtuozzo Storage clusters must offer at least 1 DWPD endurance (10 DWPD is recommended). The higher the endurance, the less often SSDs will need to be replaced, improving TCO.

    • Many consumer-grade SSD drives can ignore disk flushes and falsely report to operating systems that data was written while it in fact was not. Examples of such drives include OCZ Vertex 3, Intel 520, Intel X25-E, and Intel X-25-M G2. These drives are known to be unsafe in terms of data commits, they should not be used with databases, and they may easily corrupt the file system in case of a power failure. For these reasons, use to enterprise-grade SSD drives that obey the flush rules (for more information, see http://www.postgresql.org/docs/current/static/wal-reliability.html). Enterprise-grade SSD drives that operate correctly usually have the power loss protection property in their technical specification. Some of the market names for this technology are Enhanced Power Loss Data Protection (Intel), Cache Power Protection (Samsung), Power-Failure Support (Kingston), Complete Power Fail Protection (OCZ).

    • Consumer-grade SSD drives usually have unstable performance and are not suited to withstand sustainable enterprise workloads. For this reason, pay attention to sustainable load tests when choosing SSDs.

  • The use of SSDs for write caching improves random I/O performance and is highly recommended for all workloads with heavy random access (e.g., iSCSI volumes).

  • Running metadata services on SSDs improves cluster performance. To also minimize CAPEX, the same SSDs can be used for write caching.

  • If capacity is the main goal and you need to store non-frequently accessed data, choose SATA disks over SAS ones. If performance is the main goal, choose SAS disks over SATA ones.

  • The more disks per node the lower the CAPEX. As an example, a cluster created from ten nodes with two disks in each will be less expensive than a cluster created from twenty nodes with one disk in each.

  • Using SATA HDDs with one SSD for caching is more cost effective than using only SAS HDDs without such an SSD.

  • Use HBA controllers as they are less expensive and easier to manage than RAID controllers.

  • Disable all RAID controller caches for SSD drives. Modern SSDs have good performance that can be reduced by a RAID controller’s write and read cache. It is recommend to disable caching for SSD drives and leave it enabled only for HDD drives.

  • If you use RAID controllers, do not create RAID volumes from HDDs intended for storage (you can still do so for system disks). Each storage HDD needs to be recognized by Virtuozzo Storage as a separate device.

  • If you use RAID controllers with caching, equip them with backup battery units (BBUs) to protect against cache loss during power outages.

  • Make sure that the sum of all SSD and HDD speeds does not exceed the speed of the PCI bus to which the HBA/RAID controller is attached.

  • If one of the HDDs is slower than the rest, it will hold back the entire storage cluster. The cluster will work as fast as its slowest drive.

Network hardware recommendations:

  • Use separate networks (and, ideally albeit optionally, separate network adapters) for internal and public traffic. Doing so will prevent public traffic from affecting cluster I/O performance and also prevent possible denial-of-service attacks from the outside.

  • Network latency dramatically reduces cluster performance. Use quality network equipment with low latency links. Do not use consumer-grade network switches.

  • Do not use desktop network adapters like Intel EXPI9301CTBLK or Realtek 8129 as they are not designed for heavy load and may not support full-duplex links. Also use non-blocking Ethernet switches.

  • To avoid intrusions, Virtuozzo Storage should be on a dedicated internal network inaccessible from outside.

  • Use one 1 Gbit/s link per each two HDDs on the node (rounded up). For one or two HDDs on a node, two bonded network interfaces are still recommended for high network availability. The reason for this recommendation is that 1 Gbit/s Ethernet networks can deliver 110-120 MB/s of throughput, which is close to sequential I/O performance of a single disk. Since several disks on a server can deliver higher throughput than a single 1 Gbit/s Ethernet link, networking may become a bottleneck.

  • For maximum sequential I/O performance, use one 1Gbit/s link per each hard drive, or one 10Gbit/s link per node. Even though I/O operations are most often random in real-life scenarios, sequential I/O is important in backup scenarios.

  • For maximum overall performance, use one 10 Gbit/s link per node (or two bonded for high network availability).

  • It is not recommended to configure 1 Gbit/s network adapters to use non-default MTUs (e.g., 9000-byte jumbo frames). Such settings require additional configuration of switches and often lead to human error. 10 Gbit/s network adapters, on the other hand, need to be configured to use jumbo frames to achieve full performance.

1.2.2.3. Hardware and Software Limitations

Hardware limitations:

  • Each physical server must have at least two disks with the assigned three roles: System, Metadata, and Storage. The System role can be combined with the Metadata or Storage role, if the system disk capacity is greater than 100GB.

    Note

    1. It is recommended to assign the System+Metadata role to an SSD. Assigning both these roles to an HDD will result in mediocre performance suitable only for cold data (e.g., archiving).

    2. The System role cannot be combined with the Cache and Metadata+Cache roles. The reason is that is I/O generated by the operating system and applications would contend with I/O generated by journaling, negating its performance benefits.

  • Five servers are required to test all the features of the product.

  • The system disk must have at least 100 GBs of space.

  • The maximum supported physical partition size is 254 TiB.

Software limitations:

  • The default storage mount point must not be changed if the GUI is used.

  • The maintenance mode is not supported. Use SSH to shut down or reboot a node.

  • One node can be a part of only one cluster.

  • Only one S3 cluster can be created on top of a storage cluster.

  • Only predefined redundancy modes are available in the management panel.

  • Thin provisioning is always enabled for all data and cannot be configured otherwise.

Note

For network limitations, see Network Limitations.

1.2.2.4. Minimum Configuration

The minimum configuration described in the table will let you evaluate Virtuozzo Storage features:

Node #

1st disk role

2nd disk role

3rd and other disk roles

Access points

1

System

Metadata

Storage

iSCSI, S3 (private and public)

2

System

Metadata

Storage

iSCSI, S3 (private and public)

3

System

Metadata

Storage

iSCSI, S3 (private and public)

4

System

Metadata

Storage

iSCSI, S3 (private)

5

System

Metadata

Storage

iSCSI, S3 (private)

5 nodes in total

 

5 MDSs in total

5 or more CSs in total

Access point services run on five nodes in total

Note

SSD disks can be assigned metadata and cache roles at the same time, freeing up one more disk for the storage role.

Even though five nodes are recommended even for the minimal configuration, you can start evaluating Virtuozzo Storage with just one node and add more nodes later. At the very least, a Virtuozzo Storage cluster must have one metadata service and one chunk service running. However, such a configuration will have two key limitations:

  1. Just one MDS will be a single point of failure. If it fails, the entire cluster will stop working.

  2. Just one CS will be able to store just one chunk replica. If it fails, the data will be lost.

1.2.2.6. Raw Disk Space Considerations

When planning the Virtuozzo Storage infrastructure, keep in mind the following to avoid confusion:

  • The capacity of HDD and SSD is measured and specified with decimal, not binary prefixes, so “TB” in disk specifications usually means “terabyte”. The operating system, however, displays drive capacity using binary prefixes meaning that “TB” is “tebibyte” which is a noticeably larger number. As a result, disks may show capacity smaller than the one marketed by the vendor. For example, a disk with 6TB in specifications may be shown to have 5.45 TB of actual disk space in Virtuozzo Storage.

  • Virtuozzo Storage reserves 5% of disk space for emergency needs.

Therefore, if you add a 6TB disk to a cluster, the available physical space should increase by about 5.2 TB.

1.2.3. Planning Network

Virtuozzo Storage uses two networks (e.g., Ethernet): a) a internal network that interconnects nodes and combines them into a cluster, and b) a public network for exporting stored data to users.

The figure below shows a top-level overview of the internal and public networks of Virtuozzo Storage. One network interface on each node is also used for management: through it, administrators can access the node from the management panel and via SSH.

../_images/stor_image1.png

1.2.3.1. General Network Requirements

  • Make sure that time is synchronized on all nodes in the cluster via NTP. Doing so will make it easier for the support department to understand cluster logs.

    Note

    In Virtuozzo Hybrid Server 7, time synchronization via NTP is enabled by default using the chronyd service. If you want to use ntpdate or ntpd, stop and disable chronyd first.

1.2.3.2. Network Limitations

  • Nodes are added to clusters by their IP addresses, not FQDNs. Changing the IP address of a node in the cluster will remove that node from the cluster. If you plan to use DHCP in a cluster, make sure that IP addresses are bound to the MAC addresses of nodes’ network interfaces.

  • Fibre channel and InfiniBand networks are not supported.

  • Each node must have Internet access so updates can be installed.

  • MTU is set to 1500 by default.

  • Network time synchronization (NTP) is required for correct statistics.

  • The management role is assigned automatically during installation and cannot be changed in the management panel later.

  • Even though the management node can be accessed from a web browser by the hostname, you still need to specify its IP address, not the hostname, during installation.

1.2.3.3. Per-Node Network Requirements

Network requirements for each cluster node depend on roles assigned to the node. If the node with multiple network interfaces has multiple roles assigned to it, different interfaces can be assigned to different roles to create dedicated networks for each role.

  • Each node in the cluster must have access to the internal network and have the port 8889 open to listen for incoming connections from the internal network.

  • Each storage and metadata node must have at least one network interface for the internal network traffic. The IP addresses assigned to this interface must be either static or, if DHCP is used, mapped to the adapter’s MAC address. The figure below shows a sample network configuration for a storage and metadata node.

    ../_images/stor_image2.png
  • The management node must have a network interface for internal network traffic and a network interface for the public network traffic (e.g., to the datacenter or a public network) so the management panel can be accessed via a web browser.

    The following ports need to be open on a management node by default: 8888 for management panel access from the public network and 8889 for cluster node access from the internal network.

    The figure below shows a sample network configuration for a storage and management node.

    ../_images/stor_image3.png
  • A node that runs one or more storage access point services must have a network interface for the internal network traffic and a network interface for the public network traffic.

    The figure below shows a sample network configuration for a node with an iSCSI access point. iSCSI access points use the TCP port 3260 for incoming connections from the public network.

    ../_images/stor_image4.png

    The next figure shows a sample network configuration for a node with an S3 storage access point. S3 access points use ports 443 (HTTPS) and 80 (HTTP) to listen for incoming connections from the public network.

    ../_images/stor_image5.png

    Note

    In the scenario pictured above, the internal network is used for both the storage and S3 cluster traffic.

1.2.3.4. Network Interface Roles

For a Virtuozzo Storage cluster to function, network interfaces of cluster nodes must be assigned one or more roles described below. Assigning roles automatically configures the necessary firewall rules.

  • Internal. If one or more internal roles are assigned to a network interface, traffic on all ports is allowed to and from said interface.

    • Management. The network interface will be used for communication between the nodes and the management panel. To perform this role, the network interface must be connected to the internal network. This role must be assigned to at least one network interface in the cluster.

    • Storage. The network interface will be used for transferring data chunks between storage nodes. To perform this role, the network interface must be connected to the internal network. This role must be assigned to one network interface on each storage node.

    • Object Storage. The network interface will be used by the S3 storage access point. To perform this role, the network interface must be connected to the internal network. This role must be assigned to one network interface on each node running the S3 storage access point service.

  • Public. If one or more public roles (and no internal roles) are assigned to a network interface, only traffic on ports required by the public role(s) is allowed to and from said interface.

    • iSCSI. The network interface will be used by the iSCSI storage access point to provide access to user data. To perform this role, the network interface must be connected to the public network accessible by iSCSI clients.

    • S3 public. The network interface will be used by the S3 storage access point to provide access to user data. To perform this role, the network interface must be connected to the public network accessible by S3 clients.

    • Web CP. The network interface will be used to transfer web-based user interface data. To perform this role, the network interface must be connected to the public network.

    • SSH. The network interface will be used to manage the node via SSH. To perform this role, the network interface must be connected to the public network.

  • Custom. These roles allow you to open specific ports on public network interfaces.

1.2.3.5. Network Recommendations for Clients

The following table lists the maximum network performance a Virtuozzo Storage client can get with the specified network interface. The recommendation for clients is to use 10Gbps network hardware between any two cluster nodes and minimize network latencies, especially if SSD disks are used.

Storage network interface 1Gbps 2 x 1Gbps 3 x 1Gbps 10Gbps 2 x 10Gbps
Entire node maximum I/O throughput 100MB/s ~175MB/s ~250MB/s 1GB/s 1.75GB/s
Single VM maximum I/O throughput (replication) 100MB/s 100MB/s 100MB/s 1GB/s 1GB/s
Single VM maximum I/O throughput (erasure coding) 70MB/s ~130MB/s ~180MB/s 700MB/s 1.3GB/s

1.2.3.6. Sample Network Configuration

The figure below shows an overview of a sample Virtuozzo Storage network.

../_images/stor_image7.png

In this network configuration:

  • The Virtuozzo Storage internal network is a network that interconnects all servers in the cluster. It can be used for the management, storage (internal), and S3 (private) roles. Each of these roles can be moved to a separate dedicated internal network to ensure high performance under heavy workloads.

    This network cannot be accessed from the public network. All servers in the cluster are connected to this network.

    Important

    Virtuozzo Storage does not offer protection from traffic sniffing. Anyone with access to the internal network can capture and analyze the data being transmitted.

  • The Virtuozzo Storage public network is a network over which the storage space is exported. Depending on where the storage space is exported to, it can be an internal datacenter network or an external public network:

    • An internal datacenter network can be used to manage Virtuozzo Storage and export the storage space over iSCSI to other servers in the datacenter, that is, for the management and iSCSI (public) roles.

    • An external public network can be used to export the storage space to the outside services through S3 storage access points, that is, for the S3 public role.

1.2.4. Understanding Data Redundancy

Virtuozzo Storage protects every piece of data by making it redundant. It means that copies of each piece of data are stored across different storage nodes to ensure that the data is available even if some of the storage nodes are inaccessible.

Virtuozzo Storage automatically maintains the required number of copies within the cluster and ensures that all the copies are up-to-date. If a storage node becomes inaccessible, the copies from it are replaced by new ones that are distributed among healthy storage nodes. If a storage node becomes accessible again after downtime, the copies on it which are out-of-date are updated.

The redundancy is achieved by one of two methods: replication or erasure coding (explained in more detail in the next section). The chosen method affects the size of one piece of data and the number of its copies that will be maintained in the cluster.

The general rule is as follows:

  • Choose replication for highly loaded VMs, Windows VMs, and other workloads that generate lots of IOPS.

  • Choose erasure coding for lightly loaded Linux VMs and backups.

Virtuozzo Storage supports a number of modes for each redundancy method. The following table illustrates data overhead of various redundancy modes.

Note

The numbers of nodes listed in the table indicate only the requirements of each redundancy method but not the number of nodes needed for the Virtuozzo Storage cluster. The minimum and recommended cluster configurations are described in Minimum Configuration and Recommended Configuration, respectively.

Redundancy mode

Nodes required to store data copies

Nodes that can fail without data loss

Storage overhead, %

Raw space required to store 100GB of data

1 replica (no redundancy)

1

0

0

100GB

2 replicas

2

1

100

200GB

3 replicas

3

2

200

300GB

Encoding 1+0 (no redundancy)

1

0

0

100GB

Encoding 1+2

3

1

200

300GB

Encoding 3+2

5

2

67

167GB

Encoding 5+2

7

2

40

140GB

Encoding 7+2

9

2

29

129GB

Encoding 17+3

20

3

18

118GB

The 1+0 and 1+2 encoding modes are meant for small clusters that have insufficient nodes for other erasure coding modes but will grow in the future. As redundancy type cannot be changed once chosen (from replication to erasure coding or vice versa), this mode allows one to choose erasure coding even if their cluster is smaller than recommended. Once the cluster has grown, more beneficial redundancy modes can be chosen.

Note

After upgrading a node in a mixed cluster, you cannot migrate VEs (virtual machines and containers) created in datastores with encoding EC 3+2, 5+2, 7+2, or 17+3 from VHS 7.5 Update 4 to VHS 7.5 Update 3. However, the migration of VEs created in local datastores and datastores with a 3-replica and 2-replica data redundancy mode is available. A mixed cluster is not supported and exists during the upgrade only.

The general recommendation is to always have at least one more node in a cluster than required by the chosen redundancy scheme. For example, a cluster using replication with 3 replicas should have four nodes, and a cluster that works in the 7+2 erasure coding mode should have ten nodes. Such a cluster configuration has the following advantages:

  • The cluster will not be exposed to additional failures when in the degraded state. With one node down, the cluster may not survive another even single-disk failure without data loss.

  • You will be able to perform maintenance on cluster nodes that may be needed to recover a failed node (for example, for installing software updates).

  • In most cases, the cluster will have enough nodes to rebuild itself. In a cluster without a spare node, each replica of user data is distributed to each cluster node for redundancy. If one or two nodes go down, the user data will not be lost, but the cluster will become degraded and will only start self-healing after the failed nodes are back online. During its rebuilding process, the cluster may be exposed to additional failures until all of its nodes are healthy again.

  • You can replace and upgrade a cluster node without adding a new node to the cluster. A graceful release of a storage node is only possible if the remaining nodes in the cluster can comply with the configured redundancy scheme. You can, however, release a node forcibly without data migration, but it will make the cluster degraded and trigger the cluster self-healing.

You choose a data redundancy mode when configuring storage access points and their volumes. In particular, when creating LUNs for iSCSI storage access points and creating S3 clusters.

No matter what redundancy mode you choose, it is highly recommended is to be protected against a simultaneous failure of two nodes as that happens often in real-life scenarios.

Note

All redundancy modes allow write operations when one storage node is inaccessible. If two storage nodes are inaccessible, write operations may be frozen until the cluster heals itself.

1.2.4.1. Redundancy by Replication

With replication, Virtuozzo Storage breaks the incoming data stream into 256MB chunks. Each chunk is replicated and replicas are stored on different storage nodes, so that each node has only one replica of a given chunk.

The following diagram illustrates the 2 replicas redundancy mode.

../_images/stor_image8.png

Replication in Virtuozzo Storage is similar to the RAID rebuild process but has two key differences:

  • Replication in Virtuozzo Storage is much faster than that of a typical online RAID 1/5/10 rebuild. The reason is that Virtuozzo Storage replicates chunks in parallel, to multiple storage nodes.

  • The more storage nodes are in a cluster, the faster the cluster will recover from a disk or node failure.

High replication performance minimizes the periods of reduced redundancy for the cluster. Replication performance is affected by:

  • The number of available storage nodes. As replication runs in parallel, the more available replication sources and destinations there are, the faster it is.

  • Performance of storage node disks.

  • Network performance. All replicas are transferred between storage nodes over network. For example, 1 Gbps throughput can be a bottleneck (see Per-Node Network Requirements).

  • Distribution of data in the cluster. Some storage nodes may have much more data to replicate than other and may become overloaded during replication.

  • I/O activity in the cluster during replication.

1.2.4.2. Redundancy by Erasure Coding

With erasure coding, Virtuozzo Storage breaks the incoming data stream into fragments of certain size, then splits each fragment into a certain number (M) of 1-megabyte pieces and creates a certain number (N) of parity pieces for redundancy. All pieces are distributed among M+N storage nodes, that is, one piece per node. On storage nodes, pieces are stored in regular chunks of 256MB but such chunks are not replicated as redundancy is already achieved. The cluster can survive failure of any N storage nodes without data loss.

The values of M and N are indicated in the names of erasure coding redundancy modes. For example, in the 5+2 mode, the incoming data is broken into 5MB fragments, each fragment is split into five 1MB pieces and two more 1MB parity pieces are added for redundancy. In addition, if N is 2, the data is encoded using the RAID6 scheme, and if N is greater than 2, erasure codes are used.

The diagram below illustrates the 5+2 mode.

../_images/stor_image9.png

1.2.4.3. No Redundancy

Warning

Danger of data loss!

Without redundancy, singular chunks are stored on storage nodes, one per node. If the node fails, the data may be lost. Having no redundancy is highly not recommended no matter the scenario, unless you only want to evaluate Virtuozzo Storage on a single server.

1.2.5. Understanding Failure Domains

A failure domain is a set of services which can fail in a correlated manner. To provide high availability of data, Virtuozzo Storage spreads data replicas evenly across failure domains, according to a replica placement policy.

The following policies are available:

  • Host as a failure domain (default). If a single host running multiple CS services fails (e.g., due to a power outage or network disconnect), all CS services on it become unavailable at once. To protect against data loss under this policy, Virtuozzo Storage never places more than one data replica per host. This policy is highly recommended for clusters of five nodes and more.

  • Disk, the smallest possible failure domain. Under this policy, Virtuozzo Storage never places more than one data replica per disk or CS. While protecting against disk failure, this option may still result in data loss if data replicas happen to be on different disks of the same host and it fails. This policy can be used with small clusters of up to five nodes (down to a single node).

1.2.6. Understanding Storage Tiers

In Virtuozzo Storage terminology, tiers are disk groups that allow you to organize storage workloads based on your criteria. For example, you can use tiers to separate workloads produced by different tenants. Or you can have a tier of fast SSDs for service or virtual environment workloads and a tier of high-capacity HDDs for backup storage.

When assigning disks to tiers (which you can do at any time), have in mind that faster storage drives should be assigned to higher tiers. For example, you can use tier 0 for backups and other cold data (CS without SSD cache), tier 1 for virtual environments—a lot of cold data but fast random writes (CS with SSD cache), tier 2 for hot data (CS on SSD), caches, specific disks, and such.

This recommendation is related to how Virtuozzo Storage works with storage space. If a storage tier runs out of free space, Virtuozzo Storage will attempt to temporarily use lower tiers down to the lowest. If the lowest tier also becomes full, Virtuozzo Storage will attempt to use a higher one. If you add more storage to the original tier later, the data, temporarily stored elsewhere, will be moved to the tier where it should have been stored originally. For example, if you try to write data to the tier 2 and it is full, Virtuozzo Storage will attempt to write that data to tier 1, then to tier 0. If you add more storage to tier 2 later, the aforementioned data, now stored on the tier 1 or 0, will be moved back to the tier 2 where it was meant to be stored originally.

Inter-tier data allocation as well as the transfer of data to the original tier occurs in the background. You can disable such migration and keep tiers strict as described in Disabling Inter-Tier Data Allocation.

Note

With the exception of out-of-space situations, automatic migration of data between tiers is not supported.