Disk health analyzers

The core part of calculating disk health are analyzers. Each analyzer calculates disks health based on its own algorithm. The overall disk health is a product of disk health values from all of the analyzers.

For example:

According to the S.M.A.R.T. attributes, the disk health is 0.9.
The slow disk analyzer reports that the disk health is 0.4.
The slow CS analyzer reports that the disk health is 0.5.
According to SCSI errors, the disk health is 1.0.

The overall disk health is calculated as 0.9*0.4*0.5*1.0 and equals 0.18 or 18%.

S.M.A.R.T. attributes

The following table contains the S.M.A.R.T. attributes that affect the health value:

ID	S.M.A.R.T. attribute	Weight1	Limit2, in percent
05	Reallocated Sectors Count	2	70
187	Reported Uncorrectable Errors	1	70
188	Command Timeout	1	20
197	Current Pending Sectors Count	2	70
198	Offline uncorrectable Sectors Count	2	70
233	Media Wearout Indicator	1	100

The disk health is calculated by using the following formula:

Disk health (%) = K * П (100% - D)

where:

K is the reduction coefficient. A disk is considered less healthy if it reports more then one type of S.M.A.R.T. errors. The coefficient formula is 0.8^({Number of S.M.A.R.T. attributes with error} – 1). Possible values are 0–1.
П is a product of minimums calculated for each critical S.M.A.R.T. attribute.
100% is the initial health of the disk.
D is a minimum from the limit and attribute value with its weight. Its formula is (min(limit, attribute_value * weight)).
limit is a limit of each critical S.M.A.R.T. attribute.
attribute_value is the current attribute value.
weight is weight of each critical S.M.A.R.T. attribute.

For example:

Reallocated Sectors Count: attribute value = 30, weight = 2, limit = 70
Command Timeout: attribute value = 23, weight = 1, limit = 20

K = 0.8 * (2–1) = 0.8

The S.M.A.R.T. disk health is calculated as follows: 0.8 * (100% – (min(30*2, 70))) * (100% - min(23*1, 20))) = 0.8 * 0.4 * 0.8 = 0.256 (26%)

Slow disk and slow CS analyzers

Slow disk and CS analyzers calculate disk health according to the average I/O latency over time (15 minutes).

The following table shows the default thresholds:

Analyzer	OK latency3, in seconds	FATAL latency4, in seconds
Slow CS	0.1	60
Slow HDD Disk	0.1	30
Slow SSD Disk	0.01	10

If disk latency is less than OK latency, the disk health is considered to be 100%. If disk latency exceeds FATAL latency, the disk health is considered to be 0%. Disk latency that lies within these two thresholds will vary linearly from 100% to 0%.

When disk health becomes 0%, the service generates an alert and marks this CS as unresponsive. Such a CS does not trigger automatic replication but is no longer available for chunk allocation.

SCSI errors

By default, each SCSI failure decrease the overall disk heath by 4%. The maximum health impact is set to 70.