Disk health analyzers

The core part of calculating disk health are analyzers. Each analyzer calculates disks health based on its own algorithm. The overall disk health is a product of disk health values from all of the analyzers.

For example:

According to the S.M.A.R.T. attributes, the disk health is 0.9.
The slow disk analyzer reports that the disk health is 0.4.
The slow CS analyzer reports that the disk health is 0.5.
According to SCSI errors, the disk health is 1.0.

The overall disk health is calculated as 0.9*0.4*0.5*1.0 and equals 0.18 or 18%.

S.M.A.R.T. attributes

The following table contains the S.M.A.R.T. attributes that affect the health value:

ID	S.M.A.R.T. attribute	Weight1	Limit2, in percent
05	Reallocated Sectors Count	2	70
187	Reported Uncorrectable Errors	1	70
188	Command Timeout	1	20
197	Current Pending Sectors Count	2	70
198	Offline Uncorrectable Sectors Count	2	70
202	Percentage Lifetime Remaining	1	100
231	SSD Life Left	1	100
233	Media Wearout Indicator	1	100

The disk health is calculated by using the following formula:

Disk health (%) = K * P (100% - D)

where:

K is the reduction coefficient. A disk is considered less healthy if it reports more then one type of S.M.A.R.T. errors. The coefficient formula is 0.8^({Number of S.M.A.R.T. attributes with error} – 1). Possible values are 0–1.
P is a product of minimums calculated for each critical S.M.A.R.T. attribute.
100% is the initial health of the disk.
D is a minimum from the limit and attribute value with its weight. Its formula is (min(limit, attribute_value * weight)).
limit is a limit of each critical S.M.A.R.T. attribute.
attribute_value is the current attribute value.
weight is weight of each critical S.M.A.R.T. attribute.

For example:

Reallocated Sectors Count: attribute value = 30, weight = 2, limit = 70
Command Timeout: attribute value = 23, weight = 1, limit = 20

K = 0.8 * (2–1) = 0.8

The S.M.A.R.T. disk health is calculated as follows: 0.8 * (100% – (min(30*2, 70))) * (100% - min(23*1, 20))) = 0.8 * 0.4 * 0.8 = 0.256 (26%)

S.M.A.R.T. wearout threshold

Various SSD vendors report disk health metrics using various S.M.A.R.T. attributes. Currently, attributes 202, 231, and 233 are supported. If multiple wearout metrics are reported simultaneously, the S.M.A.R.T. 233 value takes precedence.

When the S.M.A.R.T. wearout threshold is reached, all CSes on the specified disk enter the maintenance mode.

Slow disk and slow CS analyzers

The slow disk and chunk service (CS) analyzers monitor disk I/O latency over 15-minute intervals and evaluate disk and service health to detect performance degradation.

Disk health is calculated as a percentage and is determined using two latency thresholds:

OK latency: The maximum latency (in seconds) at which disk health is considered 100%.
FATAL latency: The latency (in seconds) at which disk health is considered 0%.

The following table shows the default thresholds:

Analyzer	OK latency, in seconds	FATAL latency, in seconds
Slow CS	0.1	60
Slow HDD Disk	0.1	30
Slow SSD Disk	0.01	10

If disk latency is less than OK latency, the disk health is considered to be 100%. If disk latency exceeds FATAL latency, the disk health is considered to be 0%. Disk latency that lies within these two thresholds will vary linearly from 100% to 0%.

In addition to absolute latency thresholds, the slow CS analyzer performs latency deviation checks. These checks compare the I/O latency of a chunk service against the average latency of other chunk services within the same storage tier.

The table below lists the default deviation parameters:

Parameter	Description	Default value
`threshold`	Deviation factor applied to the average tier latency	10
`reqlimit`	Minimum chunk service request rate required to activate the check	20
`disabled`	Whether the deviation check is disabled globally Disabling the latency deviation check is not recommended, as it suppresses critical alerts and may delay detection of real hardware or performance issues.	false
Tiers	Optional per-tier overrides	—

When disk health becomes 0% or the latency deviation threshold is met or exceeded, the service generates an alert and marks this CS as unresponsive. Such a CS does not trigger automatic replication but is no longer available for chunk allocation.

SCSI errors

By default, each SCSI failure decrease the overall disk heath by 4%. The maximum health impact is set to 70.