Calculating disk health

Node disk monitoring is handled by the vstorage-disks-monitor service. This service runs on the primary management node and queries disk, chunk server (CS), and S.M.A.R.T. metrics from the Prometheus service for further analysis.

The service calculates disk health, in percent, based on each metric weight. Weights can be configured in the /etc/disks-monitor/analyzers.yml configuration file. If the calculated health score is below zero, vstorage-disks-monitor marks all chunk services (CSes) on that disk as ill (unresponsive) and fences them from cluster I/O to prevent performance degradation.

Disk wear is evaluated in parallel with disk health. To help you respond proactively, the system raises alerts based on a disk's remaining lifetime. When it falls to 10%, the system generates a warning alert. If the remaining lifetime drops to 5%, a critical alert is raised. Once the wearout threshold (fatal) is reached, the service places all CSes on that disk into maintenance mode, blocking all write operations to those CSes. This helps prevent the loss of data redundancy and availability in the event of disk failure.

Using the vstorage-disks-monitor tool, you can do the following:

  • View disk statuses by running vstorage-disks-monitor health
  • List disk alerts by running vstorage-disks-monitor alerts

The service's configuration options can be specified in the /etc/sysconfig/disks-monitor file. They include:

--fencing.enable
Enable fencing of ill CSes. Default value: true.
--fencing.max
Maximum number of ill CSes allowed in the cluster. Default value: 1.
--metric.wearout.max-ratio
Percentage of worn-out CS disks allowed in the cluster. Default value: 30.
--checker.failed_cs.enable
Enable the CS failure checker. Default value: true.
--checker.scsi_failures.enable
Enable the SCSI failure checker. Default value: true.
--checker.slow_cs.enable
Enable the slow CS checker. Default value: true.
--checker.slow_disk.enable
Enable the slow disk checker. Default value: true.
--checker.smart.enable
Enable the S.M.A.R.T. disk checker. Default value: true.

The service logs are stored at /var/log/disks-monitor/disks-monitor.log.

Limitations

  • The vstorage-disks-monitor service is disabled in clusters deployed on virtual machines.