Calculating disk health
Node disk monitoring is handled by the vstorage-disks-monitor
service. This service runs on the primary management node and queries disk, chunk server (CS), and S.M.A.R.T. metrics from the Prometheus service for further analysis.
The service calculates disk health, in percent, based on each metric weight. Weights can be configured in the /etc/disks-monitor/analyzers.yml configuration file. If the calculated health score is below zero, vstorage-disks-monitor
marks all chunk services (CSes) on that disk as ill (unresponsive) and fences them from cluster I/O to prevent performance degradation.
Disk wear is evaluated in parallel with disk health. To help you respond proactively, the system raises alerts based on a disk's remaining lifetime. When it falls to 10%, the system generates a warning alert. If the remaining lifetime drops to 5%, a critical alert is raised. Once the wearout threshold (fatal) is reached, the service places all CSes on that disk into maintenance mode, blocking all write operations to those CSes. This helps prevent the loss of data redundancy and availability in the event of disk failure.
Using the vstorage-disks-monitor
tool, you can do the following:
- View disk statuses by running
vstorage-disks-monitor health
- List disk alerts by running
vstorage-disks-monitor alerts
The service's configuration options can be specified in the /etc/sysconfig/disks-monitor file. They include:
--fencing.enable
- Enable fencing of ill CSes. Default value:
true
. --fencing.max
- Maximum number of ill CSes allowed in the cluster. Default value: 1.
--metric.wearout.max-ratio
- Percentage of worn-out CS disks allowed in the cluster. Default value: 30.
--checker.failed_cs.enable
- Enable the CS failure checker. Default value:
true
. --checker.scsi_failures.enable
- Enable the SCSI failure checker. Default value:
true
. --checker.slow_cs.enable
- Enable the slow CS checker. Default value:
true
. --checker.slow_disk.enable
- Enable the slow disk checker. Default value:
true
. --checker.smart.enable
- Enable the S.M.A.R.T. disk checker. Default value:
true
.
The service logs are stored at /var/log/disks-monitor/disks-monitor.log.
Limitations
- The
vstorage-disks-monitor
service is disabled in clusters deployed on virtual machines.