Monitoring the storage cluster

To view the storage cluster status

Admin panel

Click the cluster name at the bottom of the left menu. The status can be one of the following:

Healthy: All cluster components are active and operate normally.
Unavailable: Not enough information about the cluster state (for example, because the cluster is inaccessible).
Degraded: Some of the cluster components are inactive or inaccessible. The cluster is trying to heal itself, data replication is scheduled or in progress.
Error: The cluster has too many inactive services and automatic replication is disabled. If the cluster enters this state, troubleshoot the nodes or contact the support team.

Command-line interface

Use the following command:

vinfra cluster overview

For example, to view the status of the cluster cluster1, take a look at this line from the command output:

+-------------------+-------------------------+
| Field             | Value                   |
+-------------------+-------------------------+
| ...               | ...                     |
| status            | healthy                 |
| ...               | ...                     |
+-------------------+-------------------------+

To view the storage cluster statistics

Admin panel

Go to the Monitoring > Dashboard screen:

To view the storage cluster statistics in full screen, click Fullscreen mode.
To exit the fullscreen mode, press Esc or click Exit fullscreen mode.

The default time interval for the charts is twelve hours. To zoom into a particular time interval, select the interval with the mouse; to reset zoom, double-click any chart.

Command-line interface

Use the following command:

vstorage -c <cluster_name> top

For example, to view the general information about the cluster cluster1, take a look at this section from the command output:

Cluster 'cluster1': healthy
Space: [OK] allocatable 11.9TB of 57.3TB, free 13.0TB of 57.3TB
MDS nodes: 3 of 3, epoch uptime: 13d  3h
CS nodes:  32 of 32 (32 avail, 0 inactive, 0 offline)
License: ACTIVE (expiration: 01/01/2100, capacity: 500TB, used: 21.2TB)
Replication:  1 norm,  1 limit
IO:       read 26.2MB/s (1.9Kop/s), write  426MB/s (11Kops/s)

Cluster

Overall status of the cluster:

healthy: All chunk servers in the cluster are active.
unknown: There is not enough information about the cluster state (for example, because the master MDS server was elected a while ago).
degraded: Some of the chunk servers in the cluster are inactive.
failure: The cluster has too many inactive chunk servers; the automatic replication is disabled.
SMART warning: One or more physical disks attached to cluster nodes are in pre-failure condition.

Space

Amount of disk space in the cluster:

free: Free physical disk space in the cluster.
allocatable: Amount of logical disk space available for storing data. Allocatable disk space is calculated on the basis of the current replication parameters and free disk space on chunk servers. It may also be limited by license.

MDS nodes

Number of active MDS servers as compared to the total number of MDS servers configured for the cluster.

epoch time: Time elapsed since the MDS master server election.

CS nodes

Number of active chunk servers as compared to the total number of chunk servers configured for the cluster. In parentheses, you can see the additional information on these chunk servers:

avail: Active chunk servers that are currently up and running in the cluster.
inactive: Inactive chunk servers that are temporarily unavailable. A chunk server is marked as inactive during its first 5 minutes of inactivity.
offline: Offline chunk servers that have been inactive for more than 5 minutes. A chunk server changes its state to offline after 5 minutes of inactivity. Once the state is changed to offline, the cluster starts replicating data to restore the chunks that were stored on the offline chunk server.

License

Key number under which the license is registered on the Key Authentication server and license state.

Replication

Replication settings. The normal number of chunk replicas and the limit after which a chunk gets blocked until recovered.

IO

Disk I/O activity in the cluster:

Speed of read and write I/O operations, in bytes per second
Number of read and write I/O operations per second

To view more details about the storage cluster

Go to the Monitoring > Dashboard screen, and then click Grafana dashboard.

A separate browser tab will open with preconfigured Grafana dashboards where you can manage existing dashboards, create new ones, share them between users, configure alerting, etc. The dashboards use the Prometheus data source. Its metrics are stored for seven days. If you want to increase this retention period, you can configure it manually. For more information, refer to Grafana documentation.

On the Virtuozzo Storage core cluster overview dashboard, note the following charts:

The Mountpoint availability, MDS availability, and CS availability charts show availability of the corresponding storage services. Time periods when the services have not been available will be highlighted in gray. In this case, you will need to look into logs on the nodes with the failed service and report a problem. To see the logs, use the following commands:
- For storage mountpoints:
```
# blogcat /var/log/vstorage/<cluster_name>/vstorage-mount.*.blog
```
- For MDS:
```
# zstdcat /vstorage/mds/logs/mds.log.zst
```
- For CS:
```
# zstdcat /vstorage/<id>/cs/logs/cs.log.zst
```
For advanced monitoring of the core storage services, you can view the Virtuozzo Storage mountpoint details, Virtuozzo Storage MDS details, and Virtuozzo Storage CS details dashboards, which are intended for low-level troubleshooting by the support team.
The Data health chart shows the number of healthy chunks (the ones that have all the replicas available), the number of chunks that need to be replicated to have the configured number of replicas available, and the replication rate in chunks per second.
The Latency chart show the average latency of read and write I/O operations across all storage clients.

On the Reencoding overview dashboard, you can check the re-encoding process details, such as the re-encoding speed in the cluster and on particular nodes, the number and estimated physical size of files to be re-encoded, and the number of chunks to re re-encoded.