Performance issues and symptoms
Disks IOPS saturation
A common root cause of performance issues is the disk throughput limit. To understand if a disk has reached its limit, you can check the state of its I/O queue. If this queue is constantly full, this means that the disk is at its peak performance capacity. To investigate the state of the I/O queue for all disks, use the following command:
# iostat -x 10
The command output will be similar to this:
avg-cpu: %user %nice %system %iowait %steal %idle 4.08 0.00 1.11 0.03 0.03 94.76 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s ... %util sda 0.00 8.30 0.00 3.20 0.00 46.40 ... 0.20 sdb 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00
You need to pay attention to the following metrics:
%iowait
is the percentage of time the CPU was idle while requests were in the I/O queue. If this value is well above zero, this might mean that the node I/O is constrained by the disks speed.%idle
is the percentage of time the CPU was idle while there were no requests in the I/O queue. If this value is close to zero, this means that the node I/O is constrained by the CPU speed.%util
is the percentage of time the device I/O queue was not empty. If this value is close to 100%, this means that the disk throughput is reaching its limit.
iSCSI LUNs performance
iSCSI LUNs served via Virtuozzo Hybrid Infrastructure may experience reduced performance, especially when accessing data with a high number of threads. In this case, it is generally preferable to split the load across multiple smaller iSCSI LUNs, or if possible, avoid iSCSI by accessing storage devices directly. If it is not possible to reduce the size of LUNs, you can consider deploying an LVM RAID0 group over multiple smaller LUNs.
Journal size and location
If your cluster was deployed from an old product version, it is possible that the journal configuration is not optimal. Specifically, if the journal is configured as “inner cache,” that is, stored on the same physical device as data, the recommended size is 256 MB.
Also, consider moving the journals to a faster device such as an SSD or NVMe, if it is applicable to your workload. For more details on storage cache configuration, refer to Cache configuration.
Make sure the journal settings are the same for all journals in the same tier.
To check the current size and location of journals, run the following command and specify the desired disk:
# vinfra node disk show <DISK>
For example:
# vinfra node disk show sdc
The command output will be similar to this:
| service_params | journal_data_size: 270532608 | | journal_disk_id: 5EE99654-4D5E-4E00-8AF6-7E83244C5E6B | | journal_path: /vstorage/94aebe68/journal/journal-cs-b8efe751-96a6ff460b80 | | journal_type: inner_cache
The journal size and location will be reported as highlighted in this example.
RAM and swap usage
Though the system may operate at nearly 100 percent RAM consumption, it should not aggressively swap virtual memory to and from a disk.
You can check the state of the swap space by running:
# free -hm total used free shared buff/cache available Mem: 7,7G 2,2G 498M 3,7G 5,0G 1,5G Swap: 3,9G 352M 3,6G
In the command output, the Swap
row shows the current swap space size. If the reported size is zero, it means swapping is disabled, which may be done intentionally in some configurations.
You can also check the use of swap space by running:
# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 3 1 244208 10312 1552 62636 4 23 98 249 44 304 28 3 68 1 0 0 2 244920 6852 1844 67284 0 544 5248 544 236 1655 4 6 0 90 0 1 2 256556 7468 1892 69356 0 3404 6048 3448 290 2604 5 12 0 83 0 0 2 263832 8416 1952 71028 0 3788 2792 3788 140 2926 12 14 0 74 0 0 3 274492 7704 1964 73064 0 4444 2812 5840 295 4201 8 22 0 69 0
In the command output, the si
and so
columns show the amount of memory swapped from and to a disk, in KB/s. Swap usage may be considered acceptable as long as it well below the device maximum throughput, that is, it does not interfere with the device performance.
S3 service performance
S3 load balancing
We recommend using load balancing at all times. The only scenario that does not benefit from load balancing is when there is a single client. For recommendations on setting up load balancing, refer to the Administrator Guide.
S3 gateways
By default, the S3 service runs with four S3 gateways per node. However, you can increase the number of S3 gateways to improve the overall performance if you notice the following signs:
- The CPU usage of the S3 gateway is near 100 percent.
- The latency of the S3 service is very high (for example, the average latency of more than two seconds).
You can change the number of S3 gateways per node by using one of the following commands:
-
To set the number of S3 gateways on all nodes in the S3 cluster:
vinfra service s3 cluster change --s3gw-count <count>
--s3gw-count <count>
- Number of S3 gateways per node
For example, to increase the number of S3 gateways per node to 5, run:
# vinfra service s3 cluster change --s3gw-count 5
-
To set the number of S3 gateways on a particular S3 node:
vinfra service s3 node change --nodes <node_id> --s3gw-count <count>
--nodes <node_id>
- A comma-separated list of node hostnames or IDs
--s3gw-count <count>
- Number of S3 gateways
For example, to reduce the number of S3 gateways on the node
node003
to 3, run:# vinfra service s3 node change --nodes node003 --s3gw-count 3
When removing a gateway, keep in mind that all ongoing connections to this gateway will be dropped and all ongoing operations may result in a connection error on the client side.