Disk health analyzers
The core part of calculating disk health are analyzers. Each analyzer calculates disks health based on its own algorithm. The overall disk health is a product of disk health values from all of the analyzers.
For example:
- According to the S.M.A.R.T. attributes, the disk health is 0.9.
- The slow disk analyzer reports that the disk health is 0.4.
- The slow CS analyzer reports that the disk health is 0.5.
- According to SCSI errors, the disk health is 1.0.
The overall disk health is calculated as 0.9*0.4*0.5*1.0 and equals 0.18 or 18%.
S.M.A.R.T. attributes
The following table contains the S.M.A.R.T. attributes that affect the health value:
ID | S.M.A.R.T. attribute | Weight1 | Limit2, in percent |
---|---|---|---|
05 | Reallocated Sectors Count | 2 | 70 |
187 | Reported Uncorrectable Errors | 1 | 70 |
188 | Command Timeout | 1 | 20 |
197 | Current Pending Sectors Count | 2 | 70 |
198 | Offline uncorrectable Sectors Count | 2 | 70 |
233 | Media Wearout Indicator | 1 | 100 |
The disk health is calculated by using the following formula:
Disk health (%) = K * П (100% - D)
where:
- K is the reduction coefficient. A disk is considered less healthy if it reports more then one type of S.M.A.R.T. errors. The coefficient formula is
0.8^({Number of S.M.A.R.T. attributes with error} – 1)
. Possible values are 0–1. П
is a product of minimums calculated for each critical S.M.A.R.T. attribute.- 100% is the initial health of the disk.
D
is a minimum from the limit and attribute value with its weight. Its formula is(min(limit, attribute_value * weight))
.limit
is a limit of each critical S.M.A.R.T. attribute.attribute_value
is the current attribute value.weight
is weight of each critical S.M.A.R.T. attribute.
For example:
- Reallocated Sectors Count: attribute value = 30, weight = 2, limit = 70
- Command Timeout: attribute value = 23, weight = 1, limit = 20
- K = 0.8 * (2–1) = 0.8
The S.M.A.R.T. disk health is calculated as follows: 0.8 * (100% – (min(30*2, 70))) * (100% - min(23*1, 20))) = 0.8 * 0.4 * 0.8 = 0.256 (26%)
Slow disk and slow CS analyzers
Slow disk and CS analyzers calculate disk health according to the average I/O latency over time (15 minutes).
The following table shows the default thresholds:
Analyzer | OK latency3, in seconds | FATAL latency4, in seconds |
---|---|---|
Slow CS | 0.03 | 0.3 |
Slow HDD Disk | 0.02 | 0.1 |
Slow SSD Disk | 0.002 | 0.1 |
If disk latency is less than OK latency, the disk health is considered to be 100%. If disk latency exceeds FATAL latency, the disk health is considered to be 0%. Disk latency that lies within these two thresholds will vary linearly from 100% to 0%.
When disk health becomes 0%, the service generates an alert and marks this CS as slow. Such a CS does not trigger automatic replication but is no longer available for chunk allocation.
SCSI errors
By default, each SCSI failure decrease the overall disk heath by 4%. The maximum health impact is set to 70.