Troubleshooting node disks
The S.M.A.R.T. status of all disks is monitored by the smartctl
tool installed along with Virtuozzo Hybrid Infrastructure. Run every 10 minutes, the tool polls all disks attached to nodes, including journaling SSDs and system disks, and reports the results to the management node. The tool checks disk health according to S.M.A.R.T. attributes. If it finds a disk in the pre-failure condition, it generates an alert. The pre-failure condition means that at least one of these S.M.A.R.T. attributes is not zero:
- Reallocated Sector Count
- Reallocated Event Count
- Current Pending Sector Count
- Uncorrectable Sector Count
Additionally, slow disk and CS analyzers calculate disk health according to the average I/O latency over time. When the disk I/O latency reaches the predefined threshold, the disk health is considered to be 0%. In this case, an alert is generated and the disk is marked as unresponsive.
Limitations
- For the S.M.A.R.T. tool to work, the S.M.A.R.T. functionality must be enabled in the node’s BIOS.
Prerequisites
- A clear understanding of how disk health is calculated, described in Calculating disk health.
To troubleshoot an unresponsive disk
- Go to the Infrastructure > Nodes screen and click the name of a node that hosts an unresponsive storage disk.
- On the Disks tab, click the storage disk, and then go to the Service tab, to view the warning message.
- Check the disk connectivity, S.M.A.R.T. status, and
dmesg
output on the node.
After fixing the issue, click Mark as healthy to change the disk status to Healthy. If however the problem persists, it is recommended to replace the disk before its failure. If you recover such a disk, it may decrease the cluster performance and increase I/O latency.
To troubleshoot a failed disk
Admin panel
- Go to the Infrastructure > Nodes screen and click the name of a node that hosts a failed service.
- On the Disks tab, click the failed disk, and then go to the Service tab, to view the error message.
- Click Get diagnostic information to investigate the
smartctl
anddmesg
outputs.
Command-line interface
-
Find out the device name of a failed disk on the node from the
vinfra node disk list
output:# vinfra node disk list --node node003 +-------------+---------+------+--------+-------------+----------+----------+---------------+------------+----------------+ | id | device | type | role | disk_status | used | size | physical_size | service_id | service_status | +-------------+---------+------+--------+-------------+----------+----------+---------------+------------+----------------+ | 36972905<…> | nvme1n1 | ssd | cs | ok | 1.5TiB | 1.8TiB | 1.8TiB | 1090 | failed | | B9F2C34F<…> | nvme0n1 | ssd | cs | ok | 1.5TiB | 1.8TiB | 1.8TiB | 1091 | active | | A8E05CCA<…> | nvme2n1 | ssd | cs | ok | 1.5TiB | 1.8TiB | 1.8TiB | 1086 | active | | D6E421E0<…> | nvme3n1 | ssd | cs | ok | 1.5TiB | 1.8TiB | 1.8TiB | 1087 | active | | md126 | md126 | ssd | system | ok | 364.2MiB | 989.9MiB | 1022.0MiB | | | | md127 | md127 | ssd | system | ok | 104.4GiB | 187.1GiB | 190.2GiB | | | +-------------+---------+------+--------+-------------+----------+----------+---------------+------------+----------------+
On the node
node003
, the storage discnvme1n1
is reported asfailed
. -
Investigate the
smartctl
anddmesg
outputs for the failed disk. For example:# vinfra node disk show diagnostic-info --node node003 nvme1n1 -f yaml - command: smartctl --all /dev/nvme1n1 stdout: 'smartctl 7.1 2020-06-20 r5066 [x86_64-linux-3.10.0-1160.41.1.vz7.183.5] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: INTEL SSDPE2KX020T8 Serial Number: PHLJ9500032H2P0BGN Firmware Version: VDV10131 PCI Vendor/Subsystem ID: 0x8086 IEEE OUI Identifier: 0x5cd2e4 Total NVM Capacity: 2,000,398,934,016 [2.00 TB] Unallocated NVM Capacity: 0 Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 5cd2e4 75b0070100 Local Time is: Fri Nov 26 13:32:44 2021 EET Firmware Updates (0x02): 1 Slot Optional Admin Commands (0x000e): Format Frmw_DL NS_Mngmt Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt Maximum Data Transfer Size: 32 Pages Warning Comp. Temp. Threshold: 70 Celsius Critical Comp. Temp. Threshold: 80 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 25.00W - - 0 0 0 0 0 0 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 34 Celsius Available Spare: 99% Available Spare Threshold: 10% Percentage Used: 5% Data Units Read: 550,835,698 [282 TB] Data Units Written: 720,479,182 [368 TB] Host Read Commands: 10,050,305,459 Host Write Commands: 20,760,365,218 Controller Busy Time: 1,968 Power Cycles: 20 Power On Hours: 13,405 Unsafe Shutdowns: 16 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, max 64 entries) No Errors Logged ' - command: dmesg --ctime --kernel --level=emerg,alert,crit,err,warn --facility=kern | grep 'nvme1n1' stdout: ''
If you cannot fix the problem, contact the technical support team, as described in Getting technical support.