Disk-related metrics in Prometheus
The Prometheus service stores the following disk-related metrics:
CS-related metrics | |
csd_io_op_time_seconds
|
Mean time per I/O request |
master:mdsd_cs_status
|
CS status on master MDS |
Disk-related metrics in /proc/diskstats | |
node_disk_read_time_seconds
|
Total time, in seconds, spent on read requests |
node_disk_reads_completed
|
Total number of completed read requests |
node_disk_write_time_seconds
|
Total time, in seconds, spent on write requests |
node_disk_writes_completed
|
Total number of completed write requests |
S.M.A.R.T. metrics | |
smart_device_smart_healthy
|
S.M.A.R.T. status is healthy |
smart_reallocated_sector_ct
|
Total number of reallocated disk sectors (05) |
smart_reported_uncorrect
|
Total number of errors that could not be recovered using hardware ECC (187) |
smart_command_timeout
|
Total number of aborted operations due to a timeout (188) |
smart_current_pending_sector
|
Total number of unstable sectors (197) |
smart_offline_uncorrectable
|
Total number of uncorrectable errors when reading/writing a sector (198) |
smart_media_wearout_indicator
|
Media Wearout Indicator for SSD (233) |
smart_nvme_intel_wear_leveling
|
Media Wearout Indicator for Intel NVME (233) |
smart_scsi_read_errors_uncorrected
|
Total number of uncorrectable errors when reading a sector |
smart_scsi_reallocated_sector_ct
|
Total number of reallocated disk sectors |
smart_scsi_verify_errors_uncorrected
|
Total number of uncorrectable errors when verifying a sector |
smart_scsi_write_errors_uncorrected
|
Total number of uncorrectable errors when writing a sector |
Kernel SCSI errors | |
kernel_scsi_failures_total
|
Total number of SCSI failures reported by the kernel |
Disk health metric from vstorage-disks-monitor |
|
diskmon_cs_disk_health
|
Disk health reported by the vstorage-disks-monitor service. Possible values are 0.0–1.0. The 1.0 value means that the disk is 100% healthy. |