Cache configuration

Supported device types

Currently supported drives include HDD, SSD, and NVMe devices. Their characteristics are described in the table below.

Type

Cost

Performance

Interface and form-factor

Hard disk drives (HDD)

Low

Up to 200 MB/s

Tens/hundreds IOPS

SAS or SATA

Solid-state drives (SSD)

Average

Up to 600 MB/s

Tens of thousands IOPS

SAS or SATA

Non-volatile memory express (NVMe)

High

From 1 to 10 GB/s

Hundreds of thousands IOPS

2.5" U.2, PCIe Add-In-Card (AIC), or M.2

PMem or NVRAM devices are not officially supported.

The amount and type of cache devices supported in your cluster should be checked on each cluster node. In order to provide performance benefits, devices that provide acceleration must be faster than the underlying devices in terms of throughput, latency, or IOPS. For this reason, the possible combinations of cache and capacity devices are generally the following:

Cache devices configured in RAID1 mirroring are not officially supported.

It is recommended that all capacity devices in the same storage tier should be identical in terms of technology and size. Otherwise, there may be unpredictable performance and behavior in case of a hardware failure. Moreover, all cluster nodes should offer the same amount of storage. If this requirement is not met, the storage space in the cluster will be limited by the smallest node.

A similar recommendation applies to cache devices. As the writing speed is constrained by the slowest device in the cluster, we strongly recommend using cache devices of the same technology and size.

Choosing a cache device

As all the data ingested in the system goes through cache devices, the choice of a cache device should be based not only on speed, but also on device endurance. Device endurance is measured in two ways:

Drive Writes per Day (DWPD) measures the number of times the device can be completely overwritten each day, to reach the expected device end-of-life (usually five years).
Terabytes Written (TBW) measures the expected amount of data that can be written before the device fails.

Both parameters are equivalent and should be carefully evaluated. For example, you have a 1-TB flash drive with 1 DPWD, that means you can write 1 TB into it every day over its lifetime. If its warranty period is five years, that works out to 1 TB per day * 365 days/year * 5 years = 1825 TB of cumulative writes, after which the drive usually will have to be replaced. Thus, the drive’s TBW will be 1825.

The DWPD of a typical consumer-grade SSD drive can be as low as 0.1, while a high-end datacenter-grade flash drive can have up to 60 DWPD. For a cache device, the recommended minimum is 10 DWPD.

Another parameter to consider is power loss protection of the device. Some consumer-grade flash drives are known to silently ignore data flushing requests, which may lead to data loss in case of a power outage. Examples of such drives include OCZ Vertex 3, Intel 520, Intel X25-E, and Intel X-25-M G2. We recommend avoiding these drives (or test them with the vstorage-hwflush-check tool), and using enterprise-grade or datacenter-grade devices instead. For more information on checking power loss protection, refer to Checking disk data flushing capabilities.

Provisioning cache devices

The minimum number of cache devices per node is one. However, note that in this case, if caching is used for all capacity devices, the cache device becomes a single point of failure, which may make the entire node unavailable. In order to avoid this, at least three cache devices per node are recommended.

Using multiple cache devices also provides the following improvements:

More capacity. This can be helpful if data is written in long bursts or if the cache fails in offloading to the underlying device.
Performance boost. If there is enough parallelism on the client side, the workload can be split among several cache devices, thus increasing the overall throughput.
High availability. With fewer capacity devices per cache device or with RAID mirroring, you can lower the probability of a downtime or its impact.

It is generally recommended to provision one cache device to every 4-12 capacity devices. Keep in mind that the speed of a cache device should be at least twice as high as that of the underlying capacity devices combined. Otherwise, the cache device may be a performance bottleneck. In this case, however, using cache can still improve latency and even performance in systems with lower parallelism.

To calculate the optimal number of cache devices for your cluster, consider the following formula:

N = 0.8 * (cache_speed / capacity_speed)

Where:

N is the maximum number of capacity devices for each cache device.
cache_speed is the sustained write speed of a cache device.
capacity_speed is the sustained write speed of a capacity device.

This formula must also take into account the amount of RAM, as explained below.

For more accurate results, the device speed should be determined experimentally with real workloads. For evaluation purposes, you may use the sustained speed of sequential or random workloads, depending on the type of a workload.

To avoid performance degradation, the resulting number of capacity devices for each cache device (N) should be considered as an upper bound. In some cases, explained below, this number should at least be halved, in order to give any performance benefit.

Journal sizing

Regardless of a cache device size, its journal size can be different, depending on the available space and number of chunk services that share the cache device. There are scenarios when using a journal smaller than the available capacity leads to performance improvements.

On one hand, if the size of all journals is less than the amount of available RAM, then the journal is only used to write temporary data and guarantee consistency. Its small size allows the system to keep the journal in RAM, avoiding all reads from the journal and resulting in fewer I/O operations. Ultimately, this reduces the load on the device and in some cases may improve the overall performance (for example, when the performance of a cache and capacity devices is the same or similar).

On the other hand, if the size of all journals is more than the amount of available RAM, then the journal also serves as a read and write cache. This boosts the performance of both read and write requests, but in order to achieve such benefits, the cache device must be at least twice as fast as all of the underlying capacity devices combined, and the number of capacity devices per each cache device must at least be halved compared to the N value in the formula above. If this is not the case, it is preferable to have a smaller journal. As speed is also largely dependent on the workload, this might not be obvious.

Cache sizing

To decide on a cache device size, consider the endurance factor of a particular device and its journal size.

If you use cache for user data, then the cache device should be able to withstand sustained high throughput for as long as needed without filling up. The cache must offload its contents to the underlying device periodically, and this process depends on the speed of the underlying device. If the cache device becomes full, the system performance will degrade to the speed of the underlying devices, thus negating the caching benefits. Therefore, if the expected workload comes in bursts of a certain duration (for example, during office hours), the cache should be able to store at least the amount of data written during that period of time.

Note that 0.1% of the size of the capacity devices combined is reserved for checksums on the cache device. For example, with the recommended number of up to 12 capacity devices per cache device and HDDs of 20 TB each, the reserved space may be calculated as 0.1% * 20 TB * 12 = 240 GB. A small margin of about 5% of the device size is always kept free. All other space of the cache device is reserved for journals (read/write cache).

Risks and possible failures

Though cache devices may significantly improve the cluster performance, you need to consider their possible failures. Flash devices generally have a shorter lifespan and their use in this context exposes them to greater wear, when compared to capacity devices.

Also, keep in mind that as one cache device can be used to store multiple journals, all capacity devices associated with a cache device will become unavailable if this cache device fails.

Consider the following possible issues when using cache devices:

Data loss. A cache device failure may lead to data loss if the data has no replicas or RAID mirroring is not configured.
Performance degradation. If a cache device fails, the system will use other devices for storing data, which may result in a performance bottleneck or trigger the data rebalancing process to restore the data redundancy. This, in turn, will lead to increased disk and network usage and reduce the cluster performance.
Low availability. With a failed cache device, data redundancy may be degraded, which may result in a read-only or unreadable cluster in severe cases.
Less capacity. If a cache device fails, several capacity devices may become unavailable, leading to a lack of disk space available for writing new data.

To prevent these issues, use optimal redundancy policies and multiple cache devices in your system. Additionally, you can consider the possibility of using local replication (for example, RAID1) on top of distributed replication, especially in systems with low replication factors (1 replica or 1+0 encoding).