8.2. Appendix B - Frequently Asked Questions¶

This Appendix lists most frequently asked questions about Storage clusters.

8.2.1. General¶

Can /pstorage directory still be used on newer installations?

Yes. In newer installations, /pstorage remains as a symlink to the new /vstorage directory for compatibility purposes.
Do I need to buy additional storage hardware for Storage?

No. Storage eliminates the need for external storage devices typically used in SANs by converting locally attached storage from multiple nodes into a shared storage.
What are the hardware requirements for Storage?

Storage does not require any special hardware and can run on commodity computers with traditional SATA drives and 1 GbE networks. Some hard drives and RAID controllers, however, ignore the FLUSH command to imitate better performance and must not be used in clusters as this may lead to file system or journal corruptions. This is especially true for RAID controllers and SSD drives. Please consult with your hard drive’s manual to make sure you use reliable hardware.

For more information, see Hardware Requirements.
How many servers do I need to run a Storage cluster?

You need only one physical server to start using Storage. However, to provide high availability for your data, you are recommended to have at least 3 replicas per each data chunk. For this, you will need at least three nodes (and at least five in total) in the cluster. For details, see |vstorage| Configurations and Configuring Replication Parameters.
Can I join Hardware Nodes running different supported operating systems into a single Storage cluster?

Yes. You can create Storage clusters of Hardware Nodes running any combination of supported operating systems. For example, you can have metadata servers on Hardware Nodes with Ubuntu 14.04, chunk servers on Hardware Nodes with Red Hat Enterprise Linux 7, and clients on computers with CentOS 7.

Note

The current standalone version of Storage does not support Virtuozzo.

8.2.2. Scalability and Performance¶

How many servers can I join to a Storage cluster?

There is no strict limit on the number of servers you can include in a Storage cluster. However, you are recommended to limit the servers in the cluster to a single rack to avoid any possible performance degradation due to inter-rack communications.
How much disk space can a Storage cluster have?

A Storage cluster can support up to 8 PB of effective available disk space, which totals to 24 PB of physical disk space when 3 replicas are kept for each data chunk.
Can I add nodes to an existing Storage cluster?

Yes, you can dynamically add and remove nodes from a Storage cluster to increase its capacity or to take nodes offline for maintenance. For more information, see Configuring Chunk Servers.
What is the expected performance of a Storage cluster?

The performance depends on the network speed and the hard disks used in the cluster. In general, the performance should be similar to locally attached storage or better. You can also use SSD caching to increase the performance of commodity hardware by adding SSD drives to the cluster for caching purposes. For more information, see Using SSD Drives.
What performance should I expect on 1-gigabit Ethernet?

The maximum speed of a 1 GbE network is close to that of a single rotational drive. In most workloads, however, random I/O access is prevalent and the network is usually not a bottleneck. Research with large service providers has proved that average I/O performance rarely exceeds 20 MB/sec due to randomization. Virtualization itself introduces additional randomization as multiple independent environments perform I/O access simultaneously. Nevertheless, 10-gigabit Ethernet will often result in better performance and is recommended for use in production.
Will the overall cluster performance improve if I add new chunk servers to the cluster?

Yes. Since data is distributed among all hard drives in the cluster, applications performing random I/O experience an increase in IOPS when more drives are added to the cluster. Even a single client machine may get noticeable benefits by increasing the number of chunk servers and achieve performance far beyond traditional, locally attached storage.
Does performance depend on the number of chunk replicas?

Each additional replica degrades write performance by about 10%, but at the same time it may also improve read performance because the Storage cluster has more options to select a faster server.

8.2.3. Availability¶

How does Storage protect my data?

Storage protects against data loss and temporary unavailability by creating data copies (replicas) and storing them on different servers. To provide additional reliability, you can configure Storage to maintain user data checksums and verify them when necessary.
What happens when a disk is lost or a server becomes unavailable?

Storage automatically recovers from a degraded state to the specified redundancy level by replicating data on live servers. Users can still access their data during the recovery process.
How fast does Storage recover from a degraded state?

Since Storage recovers from a degraded state using all the available hard disks in the cluster, the recovery process is much faster than for traditional, locally attached RAIDs. This makes the reliability of the storage system significantly better as the probability of losing the only remaining copy of data during the recovery period is very small.
Can I change redundancy settings on the fly?

Yes, at any point you can change the number of data copies, and Storage will apply the new settings by creating new copies or removing unneeded ones. For more details on configuring replication parameters, see Configuring Replication Parameters.
Do I still need to use local RAIDs?

No,Storage provides the same built-in data redundancy as a mirror RAID1 array with multiple copies. However, for better sequential performance, you can use local stripping RAID0 exported to your Storage cluster. For more information on using RAIDs, see Exploring Possible Disk Drive Configurations.
Does Storage have redundancy levels similar to RAID5?

No. To build a reliable software-based RAID5 system, you also need to use special hardware capabilities like backup power batteries. In the future, Storage may be enhanced to provide RAID5-level redundancy for read-only data such as backups.
What is the recommended number of data copies?

It is recommended to configure Storage to maintain 2 or 3 copies, which allows your cluster to survive the simultaneous loss of 1 or 2 hard drives.

8.2.4. Cluster Operation¶

How do I know that the new replication parameters have been successfully applied to the cluster?

To check whether the replication process is complete, run the vstorage top command, press the V key on your keyboard, and check information in the Chunks field:
- When decreasing the replication parameters, no chunks in the overcommitted or deleting state should be present in the output.
- When increasing the replication parameters, no chunks in the blocked or urgent state should be present in the output. Besides, the overall cluster health should equal 100%.
For details, see Monitoring Replication Parameters.
How do I shut down a cluster?

To shut down a Storage cluster:
1. Stop and disable all the required cluster services.
2. Stop all running containers and virtual machines.
3. Stop all clients.
4. Stop all chunk servers.
5. Stop all MDS servers.
For details, see Shutting Down and Starting Up Cluster Nodes.
What tool do I use to monitor the status and health of a cluster?

You can monitor the status and health of your cluster using the vstorage top command. For details, see Monitoring Storage Clusters.

To view the total amount of disk space occupied by all user data in the cluster, run the vstorage top command, press the V key on your keyboard, and look for the FS field in the output. The FS field shows how much disk space is used by all user data in the cluster and in how many files these data are stored. For details, see Understanding Disk Space Usage.
How do I configure a Virtuozzo server for a cluster?

To prepare a server with Virtuozzo for work in a cluster, you simply tell the server to store its containers and virtual machines in the cluster rather than on its local disk. For details, see Stage 3: Configuring Virtual Machines and Containers.
Why vmstat/top and Storage stat show different IO times?

The vstorage and vmstat/top utilities use different methods to compute the percentage of CPU time spent waiting for disk IO ( wa% in top, wa in vmstat, and IOWAIT in vstorage). The vmstat and top utilities mark an idle CPU as waiting only if an outstanding IO request is started on that CPU, while the vstorage utility marks all idle CPUs as waiting, regardless of the number of IO requests waiting for IO. As a result, vstorage can report much higher IO values. For example, on a system with 4 CPUs and one thread doing IO, vstorage will report over 90% IOWAIT time, while vmstat and top will show no more than 25% IO time.
What effect tier numbering has on Storage operation?

When assigning storage to tiers, have in mind that faster storage drives should be assigned to higher tiers. For example, you can use tier 0 for backups and other cold data (CS without SSD journals), tier 1 for virtual environments—a lot of cold data but fast random writes (CS with SSD journals), tier 2 for hot data (CS on SSD), journals, caches, specific virtual machine disks, and such.

This recommendation is related to how Storage works with storage space. If a storage tier runs out of free space, Storage will attempt to temporarily use a lower tier. If you add more storage to the original tier later, the data, temporarily stored elsewhere, will be moved to the tier where it should have been stored originally.

For example, if you try to write data to the tier 2 and it is full, Storagewill attempt to write that data to tier 1, then to tier 0. If you add more storage to tier 2 later, the aforementioned data, now stored on the tier 1 or 0, will be moved back to the tier 2 where it was meant to be stored originally.

8.2.5. Shaman Service¶

What is the role of the shaman master?

The master instance of the shaman service watches over nodes availability. When it detects that a node is not available, it revokes the node’s access to the storage cluster and relocates resources from the unavailable node to other nodes in the cluster.
How is the shaman master elected? What ports are used for communication between cluster members to achieve this? What is the algorithm used for election and also for shaman workers to determine that the master is dead?

Shaman uses the Storage file system for:
1. Master election
2. Node availability detection
3. Communication between the master and worker instances
It does not use networking directly. All communications go via the Storage file system. Storage itself uses the Paxos algorithm to reach a consensus between its members. By default, MDS servers use TCP ports 2510 and 2511 to exchange messages.

Following is a detailed description of how shaman utilizes the Storage file system for the above purposes.

A file on the Storage file system can be opened either in the exclusive (read-write) mode or the shared (read-only) mode. When a process opens a file, the master MDS instance grants the process a temporal lease for that file (exclusive or shared, respectively). The lease must be periodically refreshed by the process. The default timeout for a lease is 60 seconds. If the process that has obtained the lease fails to refresh it within 60 seconds, the master MDS instance revokes the lease from the process.

An exclusive lease granted for a file prevents other processes from obtaining a lease for that file. A process trying to open a file for which an exclusive lease has been granted receives an error. Even if the lease is granted to a process on another node.

Each shaman instance interacts with at least two files:
1. .shaman/master. Whoever takes an exclusive lease on this file becomes the master. Other instances periodically try to open this file. If a master instance becomes unavailable, then, after the lease timeout elapses, some other shaman instance may finally obtain the lease on this file and become a new master.
2. .shaman/md.$host_id/lock. On start, a shaman instance opens this file and refreshes its lease until the service is stopped. The master instance periodically tries opening the .shaman/md.*/lock files. If it succeeds, that means the shaman instance on the corresponding $host_id failed to refresh the lease for 60 seconds. The master instance treats this as a node crash event. It revokes the node access to the Storage file system with the vstorage revoke command. Then it schedules the relocation of resources registered on the crashed node to other nodes in the cluster by moving files from the .shaman/md.$host_id/resources directory to .shaman/md.$host_id/pool directories on other cluster nodes, according to the scheduler’s decision. Each instance of the shaman service periodically checks the contents of its .shaman/md.$host_id/pool directory. It it finds new files there, it runs appropriate scripts from the /usr/share/shaman directory to relocate the resource to the node. After the relocation is done, the shaman instance moves the file from its pool directory into its resources directory on the Storage file system.
The default lease timeout can be changed with the shaman set-config LOCK_TIMEOUT=seconds command, as described in man shaman.conf.
What is the purpose of the shaman-monitor watchdog?

The watchdog checks that its node has an access to the Storage file system. If the Storage file system is not available (e.g., due to network connectivity issues) for WATCHDOG_TIMEOUT seconds, the WATCHDOG_ACTION is executed by the kernel.

The default action is netfilter. It blocks incoming and outgoing network packets on a node except for those required by SSH and the network file system.

This helps in situations when the following happens to a node:
- The internal network that handles Storage packets fails.
- The external network (e.g., one with public IP addresses) continues to work.
- The node has not been rebooted yet.
- Shaman has already relocated Storage resources from the node and resumed them on a healthy one.
In such cases, this failed node will continue to send outdated resource reports via its working external network. To avoid this, if node’s internal network fails, shaman blocks all of node networks by means of firewall rules.

If the netfilter firewall rules cannot be loaded on shaman service start for some reason, the action falls back to reboot. The main and the fallback actions are configured with the command shaman set-config WATCHDOG_ACTION=action[,action...], as described in man shaman.conf.

The reboot, crash, and halt actions are perfomed by writing the corresponding value into the /proc/sysrq-trigger special file. The netfilter action is configured by loading firewall rules from the /usr/share/shaman/init_wdog file, as described in man shaman.conf and man shaman-scripts.

So why is the watchdog needed at all?

When a node cannot access the Storage file system, the following outcomes are possible:
1. The node is isolated from the Storage network. The master shaman instance revoked the node access to Storage and relocated its resources to another node, even though the node itself is not aware of this yet. At the same time:
  1. The node may also be cut off from the public network that provides outside access to containers and VMs. The watchdog is not needed in this case.
  2. The node may still be connected to the public network. In this case, a container or VM instance relocated to and started on a healthy cluster node may get the same MAC or IP address as the original instance that may still be running on the failed node even though Storage is unavailable. The watchdog is needed in this case in order to prohibit access to containers and VMs that may be still running on the failed node.
2. The Storage file system is completely unavailable on all nodes. It can happen when the Storage network switch is down, or the majority of Storage MDS instances become unavailable at the same time. Watchdog is needed in this case to prohibit access from the outside to containers and VMs that cannot access the backing storage. If Storage connectivity returns on all nodes at once, the watchdog unfences the nodes and no relocations happen. The reason is that the shaman master is unavailable when the Storage file system is unavailable in the cluster.
So, by default, the watchdog simply fences nodes by enabling the corresponding firewall rules. The reason that nodes reboot often is that LEASE_LOST_ACTION and CLUSTER_MOUNTPOINT_DEAD_ACTION parameters are set to reboot by default. They can be set to none, but then one would need to manually clean up each node from which the master had relocated containers and VMs.
What does netfilter do? Why reboot hardware nodes instead of restarting shaman-monitor?

When the netfilter action is chosen, the shaman service installs firewall rules with a custom matcher module on its start. The firewall rules are permissive, unless the watchdog is triggered by a timeout. After that, these rules become prohibitive. If the watchdog starts working again (e.g., if the Storage file system becomes accessible), the rules become permissive again.

A reboot is needed to clean up a node after the master shaman instance relocates containers or VMs from it. In this case, remnant ploop devices are left on the node. These ploop devices are created on top of Storage file system entries that are no longer available when the master revokes node’s access to Storage. Mounting Storage once again on such a node will not help. One can manually kill all the remnant container processes, detach ploop devices, kill hanged QEMU instances (in VMs case), and so on. Yet some leftovers of container or VM instances that are already running on another cluster node may still remain. A node reboot, however, always results in a clean environment.
How does shaman notify the cluster about node’s graceful shutdown or reboot?

Currently, the shaman service does not notify the cluster when a graceful reboot is performed on a node. That is why all resources are left on the node when it is normally rebooted. But if node’s shaman service is killed with SIGKILL just before the node is going to reboot, the master shaman instance will treat it as a node crash event and relocate resources from this node.
Does DRS use SNMP to track resource consumption on each hardware node?

Yes, it does. An rmond plugin for snmpd is installed along with the pdrs service on each cluster node. The plugin exposes container and VM resource consumption metrics via the SNMP interface. The snmpd service provides these metrics along with other standard counters.

The pdrs service is started on each cluster node. One of the pdrs instances becomes the master in the same way shaman does. The pdrs master is not coupled with the shaman master, however. These services may be running on different nodes.

Each pdrs instance registers itself in the cluster by creating an entry in the .drs directory on the Storage file system. When the master instance discovers a new entry in this directory, it sends an SNMP TRAP subscription request to the snmpd instance running on that node. The snmpd instance periodically sends SNMP TRAP messages on the UDP port 33333 (set in .drs/config) to the pdrs master’s node. The master instance receives incoming messages on this port and this way collects statistics from all nodes in the cluster. The pdrs master instance also aggregates and stores the statistics in .drs/memory.node_list and .drs/episode* files on the Storage file system.

When a node crashes, the master shaman instance gets a list of resources located on the crashed node from the .shaman/md.$host_id/resources directory. It then invokes the scheduler defined in the RESOURCE_RELOCATION_MODE parameter of the global configuration. By default, this parameter has the drs value in the first item of the scheduler list. Shaman then invokes the /usr/share/shaman/pdrs_schedule script that connects to the pdrs instance running on the node. The pdrs instance reads the contents of .drs/memory.node_list and .drs/episode* files on the Storage file system and makes a scheduling decision based on this data. Note that the pdrs instance does not have to be the master to do this. After that, the master shaman instance receives the scheduling decision from the script called earlier and relocates the resources from the crashed node according to the decision.
Does DRS get the list of hardware nodes to query via SNMP from shaman?

The pdrs service requires only one file from shaman, /etc/shaman/shaman.conf, that contains the storage cluster name. Shaman, in its turn, only needs to run the /usr/share/shaman/pdrs_schedule script if RESOURCE_RELOCATION_MODE contains drs.
Attempting to run scripts mentioned in Managing Cluster Resources with Scripts produces an error. Are these scripts the basis for correctly evacuating nodes for maintenance purposes? If not, how should that be done, given that graceful shutdown does not automatically relocate containers to a healthy node?

Most shaman scripts are executed by the master shaman instance when it handles a node crash. For example, you can add a script that will send an alert that a node has crashed via an external web service. These scripts will not help with evacuating a node when preparing it for maintenance. For now, the best way to do it is to manually live-migrate all resources to other cluster nodes and only after that reboot the node.

Version 7.5 — Jan 22, 2025

Edit Print