Disk health analyzers
The core part of calculating disk health are analyzers. Each analyzer calculates disks health based on its own algorithm. The overall disk health is a product of disk health values from all of the analyzers.
For example:
- According to the S.M.A.R.T. attributes, the disk health is 0.9.
- The slow disk analyzer reports that the disk health is 0.4.
- The slow CS analyzer reports that the disk health is 0.5.
- According to SCSI errors, the disk health is 1.0.
The overall disk health is calculated as 0.9*0.4*0.5*1.0 and equals 0.18 or 18%.
S.M.A.R.T. attributes
The following table contains the S.M.A.R.T. attributes that affect the health value:
| ID | S.M.A.R.T. attribute | Weight1 | Limit2, in percent |
|---|---|---|---|
| 05 | Reallocated Sectors Count | 2 | 70 |
| 187 | Reported Uncorrectable Errors | 1 | 70 |
| 188 | Command Timeout | 1 | 20 |
| 197 | Current Pending Sectors Count | 2 | 70 |
| 198 | Offline Uncorrectable Sectors Count | 2 | 70 |
| 202 | Percentage Lifetime Remaining | 1 | 100 |
| 231 | SSD Life Left | 1 | 100 |
| 233 | Media Wearout Indicator | 1 | 100 |
The disk health is calculated by using the following formula:
Disk health (%) = K * P (100% - D)
where:
Kis the reduction coefficient. A disk is considered less healthy if it reports more then one type of S.M.A.R.T. errors. The coefficient formula is0.8^({Number of S.M.A.R.T. attributes with error} – 1). Possible values are 0–1.Pis a product of minimums calculated for each critical S.M.A.R.T. attribute.- 100% is the initial health of the disk.
Dis a minimum from the limit and attribute value with its weight. Its formula is(min(limit, attribute_value * weight)).limitis a limit of each critical S.M.A.R.T. attribute.attribute_valueis the current attribute value.weightis weight of each critical S.M.A.R.T. attribute.
For example:
- Reallocated Sectors Count: attribute value = 30, weight = 2, limit = 70
- Command Timeout: attribute value = 23, weight = 1, limit = 20
- K = 0.8 * (2–1) = 0.8
The S.M.A.R.T. disk health is calculated as follows: 0.8 * (100% – (min(30*2, 70))) * (100% - min(23*1, 20))) = 0.8 * 0.4 * 0.8 = 0.256 (26%)
S.M.A.R.T. wearout threshold
Various SSD vendors report disk health metrics using various S.M.A.R.T. attributes. Currently, attributes 202, 231, and 233 are supported. If multiple wearout metrics are reported simultaneously, the S.M.A.R.T. 233 value takes precedence.
When the S.M.A.R.T. wearout threshold is reached, all CSes on the specified disk enter the maintenance mode.
Slow disk and slow CS analyzers
The slow disk and chunk service (CS) analyzers monitor disk I/O latency over 15-minute intervals and evaluate disk and service health to detect performance degradation.
Disk health is calculated as a percentage and is determined using two latency thresholds:
- OK latency: The maximum latency (in seconds) at which disk health is considered 100%.
- FATAL latency: The latency (in seconds) at which disk health is considered 0%.
The following table shows the default thresholds:
| Analyzer | OK latency, in seconds | FATAL latency, in seconds |
|---|---|---|
| Slow CS | 0.1 | 60 |
| Slow HDD Disk | 0.1 | 30 |
| Slow SSD Disk | 0.01 | 10 |
If disk latency is less than OK latency, the disk health is considered to be 100%. If disk latency exceeds FATAL latency, the disk health is considered to be 0%. Disk latency that lies within these two thresholds will vary linearly from 100% to 0%.
In addition to absolute latency thresholds, the slow CS analyzer performs latency deviation checks. These checks compare the I/O latency of a chunk service against the average latency of other chunk services within the same storage tier.
The table below lists the default deviation parameters:
| Parameter | Description | Default value |
|---|---|---|
threshold
|
Deviation factor applied to the average tier latency | 10 |
reqlimit
|
Minimum chunk service request rate required to activate the check | 20 |
disabled
|
Whether the deviation check is disabled globally Disabling the latency deviation check is not recommended, as it suppresses critical alerts and may delay detection of real hardware or performance issues. |
false |
| Tiers | Optional per-tier overrides | — |
When disk health becomes 0% or the latency deviation threshold is met or exceeded, the service generates an alert and marks this CS as unresponsive. Such a CS does not trigger automatic replication but is no longer available for chunk allocation.
SCSI errors
By default, each SCSI failure decrease the overall disk heath by 4%. The maximum health impact is set to 70.