Performance-related system alerts
Disk is out of space
It is strongly advised to monitor the following out-of-space disk alerts closely:
- Cluster is out of space
- Disk may run out of space
- Disk is out of space
- Metadata disk is out of space
Disks at near-full capacity typically show decreased performance, especially during write operations. Filling disks further might result in other access issues, such as inability to write more data. For these reasons, we do not recommend exceeding 80 percent of used disk space at all times.
Note that this may happen regardless of space usage in an assigned storage tier. In other words, a disk may have more than 80 percent of used space even if its assigned storage tier is used by less than 80 percent.
RAID resyncing
The Software RAID is not fully synced alert is raised when resyncing of a software RAID device managed by the md
service is in progress. If the MDS disk is affected, this process decreases the cluster performance until the RAID rebuilding is finished.
Too many chunks or files in the cluster
When the cluster has too many files or chunks, the following alerts are raised:
- Cluster has too many chunks
- Cluster has a critically high number of chunks
- Cluster has too many files
- Cluster has a critically high number of files
This is a sign that the cluster is reaching its resource limit, which will impact the performance of metadata operations. When this limit is reached, the performance degradation may become a problem. In this case, you can increase the chunk size (however, this may also affect the performance), or redirect some workloads to a different cluster.
The general recommendation is not to exceed four million files and ten million chunks.
CPU overload
When the metadata (MDS) and S3 gateway services use CPU resources above a threshold, the system issues the following alerts:
- Metadata service has high CPU usage
- S3 Gateway service has high CPU usage
- S3 Gateway service has critically high CPU usage
The CPU usage of the metadata service can be checked either in Grafana or the command line.
In Grafana, you can check it on the Virtuozzo Storage MDS details dashboard:
To check the MDS CPU usage via the command line, use the following command:
# vstorage -c <CLUSTER> top
Press m to focus on the metadata service. In the command output, the current CPU usage will be reported in the %CPU
column, as highlighted in the following example:
MDSID STATUS %CTIME COMMITS %CPU MEM UPTIME HOST 3 avail 0.0% 0/s 0.2% 142m 6d 1h management… M 1 avail 0.1% 1/s 0.2% 149m 6d 1h management… 2 avail 0.1% 1/s 0.2% 149m 6d 1h management…
The alert is raised when the CPU usage is above 80 percent for at least five minutes. For the metadata service, it is normal to use 100 percent of the CPU during peak traffic. However, this may also indicate an issue if the alert persists longer than one day and it is accompanied by other performance issues (for example, high latency).
Similarly, the CPU usage of the Object storage service can be checked on the Object Storage overview dashboard in Grafana:
High latency
There are several processes that can be affected by high latency. The system will issue alerts when the latency of these services is too high:
- Metadata service (MDS)
- Chunk service (CS)
- Object storage (S3) services
Metadata service latency
The metadata service latency represents the response time of all metadata operations. You can check the metadata service latency on the Virtuozzo Storage MDS details dashboard in Grafana:
The Metadata service has high commit latency alert triggers when the 95th percentile latency of the service is higher than one second for more than five minutes.
Though it is considered normal for latency to increase during peak hours, it may also indicate an issue if the alert persists for more than one day and performance is below expectations.
Chunk server latency
The chunk service latency can be checked either in Grafana or the command line.
In Grafana, you can check it on the Virtuozzo Storage core cluster overview dashboard:
You can also monitor the latency of a particular disk on the Hardware node details dashboard:
Alternatively, the latency of a storage disk is shown on the Virtuozzo Storage CS details dashboard:
To check the chunk server latency via the command line, use the following command:
# vstorage -c <CLUSTER> top
Press i to view the optime
values. The command output will be similar to this:
CSID IOWAIT SWAIT OPTIME(ms) IOLAT(ms) SLAT(ms) QDEPTH RMW JRMW 1027 0% 0% N/A 0/0 0/0 0.0 0ops/s 0ops/s 1029 0% 0% N/A 0/0 0/0 0.0 0ops/s 0ops/s 1025 0% 0% N/A 0/0 0/0 0.0 0ops/s 0ops/s ... (rows 1-3 of 6)
The optime
represents the average time spent serving each I/O request, ignoring the time it spent in the I/O queue. If the optime is consistently high, for example, higher than 100 ms, this means that there might be an issue with the I/O path. If the issue is caused by disk wear, it can be fixed by replacing the disk. Other solutions include upgrading the device firmware, replacing faulty cables, replacing a faulty controller, or adding capacity.
Object storage service latency
The system raises alerts when one of the following latencies exceeds a certain threshold for more than five minutes:
- S3 gateway GET latency
- Object service commit latency
- Object service request latency
- Name service commit latency
- Name service request latency
Though it is considered normal for latency to increase during peak hours, it may also indicate an issue if the alert persists for more than one day and performance is below expectations.
You can check the latency for the object storage services in Grafana. To monitor the S3 gateway GET and PUT latency, use the S3 overview dashboard:
To check the object service latency, go to the Object Storage OS details dashboard:
The name service latency can be checked on the Object Storage NS details dashboard:
S.M.A.R.T. alerts
S.M.A.R.T. alerts must be treated with high priority. All signs of disk wear can decrease the cluster performance, in addition to further performance degradation in case of a device failure (which may include the loss of redundancy and availability, and also node downtime if the system disk is affected). Moreover, recovery operations, such as storage rebalancing, may have an impact on the system performance.