Disk-related metrics in Prometheus

CS-related metrics

csd_io_op_time_seconds

Mean time per I/O request

master:mdsd_cs_status

CS status on master MDS

Disk-related metrics in /proc/diskstats

node_disk_read_time_seconds

Total time, in seconds, spent on read requests

node_disk_reads_completed

Total number of completed read requests

node_disk_write_time_seconds

Total time, in seconds, spent on write requests

node_disk_writes_completed

Total number of completed write requests

S.M.A.R.T. metrics

smart_device_smart_healthy

S.M.A.R.T. status is healthy

smart_reallocated_sector_ct

Total number of reallocated disk sectors (05)

smart_reported_uncorrect

Total number of errors that could not be recovered using hardware ECC (187)

smart_command_timeout

Total number of aborted operations due to a timeout (188)

smart_current_pending_sector

Total number of unstable sectors (197)

smart_offline_uncorrectable

Total number of uncorrectable errors when reading/writing a sector (198)

smart_media_wearout_indicator

Media Wearout Indicator for SSD (233)

smart_nvme_intel_wear_leveling

Media Wearout Indicator for Intel NVME (233)

smart_scsi_read_errors_uncorrected

Total number of uncorrectable errors when reading a sector

smart_scsi_reallocated_sector_ct

Total number of reallocated disk sectors

smart_scsi_verify_errors_uncorrected

Total number of uncorrectable errors when verifying a sector

smart_scsi_write_errors_uncorrected

Total number of uncorrectable errors when writing a sector

Kernel SCSI errors

kernel_scsi_failures_total

Total number of SCSI failures reported by the kernel

Disk health metric from vstorage-disks-monitor

diskmon_cs_disk_health

Disk health reported by the vstorage-disks-monitor service. Possible values are 0.0–1.0. The 1.0 value means that the disk is 100% healthy.