Troubleshooting node disks

The vstorage-disks-monitor service monitors node disks, detecting ill (unresponsive) chunk services (CSes), disk wearout, and pre-failure conditions.

Health analysis
Disk health is assessed based on the collected metrics values and weights. If the calculated health score is below zero, the disk is marked as unresponsive.
Wearout detection
S.M.A.R.T. attributes related to disk wear help identify disks nearing the end of their lifespan. Once the wearout threshold is reached, all CSes on the affected disk enter maintenance. It is recommended to replace the disk before failure.
Pre-failure detection
Other S.M.A.R.T. attributes help identify disks in a pre-failure state. To prevent data loss, replace such disks proactively.

Prerequisites

To troubleshoot an unresponsive disk

  1. Go to the Infrastructure > Nodes screen and click the name of a node that hosts an unresponsive storage disk.
  2. On the Disks tab, click the storage disk, and then go to the Service tab, to view the warning message.
  3. Check the disk connectivity, S.M.A.R.T. status, and dmesg output on the node.

After fixing the issue, click Mark as healthy to change the disk status to Healthy. If however the problem persists, it is recommended to replace the disk before its failure. If you recover such a disk, it may decrease the cluster performance and increase I/O latency.

To troubleshoot a failed disk

Admin panel

  1. Go to the Infrastructure > Nodes screen and click the name of a node that hosts a failed service.
  2. On the Disks tab, click the failed disk, and then go to the Service tab, to view the error message.
  3. Click Get diagnostic information to investigate the smartctl and dmesg outputs.

Command-line interface

  1. Find out the device name of a failed disk on the node from the vinfra node disk list output:

    # vinfra node disk list --node node003
    +-------------+---------+------+--------+-------------+----------+----------+---------------+------------+----------------+
    | id          | device  | type | role   | disk_status | used     | size     | physical_size | service_id | service_status |
    +-------------+---------+------+--------+-------------+----------+----------+---------------+------------+----------------+
    | 36972905<…> | nvme1n1 | ssd  | cs     | ok          | 1.5TiB   | 1.8TiB   | 1.8TiB        | 1090       | failed         |
    | B9F2C34F<…> | nvme0n1 | ssd  | cs     | ok          | 1.5TiB   | 1.8TiB   | 1.8TiB        | 1091       | active         |
    | A8E05CCA<…> | nvme2n1 | ssd  | cs     | ok          | 1.5TiB   | 1.8TiB   | 1.8TiB        | 1086       | active         |
    | D6E421E0<…> | nvme3n1 | ssd  | cs     | ok          | 1.5TiB   | 1.8TiB   | 1.8TiB        | 1087       | active         |
    | md126       | md126   | ssd  | system | ok          | 364.2MiB | 989.9MiB | 1022.0MiB     |            |                |
    | md127       | md127   | ssd  | system | ok          | 104.4GiB | 187.1GiB | 190.2GiB      |            |                |
    +-------------+---------+------+--------+-------------+----------+----------+---------------+------------+----------------+
    

    On the node node003, the storage disc nvme1n1 is reported as failed.

  2. Investigate the smartctl and dmesg outputs for the failed disk. For example:

    # vinfra node disk show diagnostic-info --node node003 nvme1n1 -f yaml
    - command: smartctl --all /dev/nvme1n1
      stdout: 'smartctl 7.1 2020-06-20 r5066 [x86_64-linux-3.10.0-1160.41.1.vz7.183.5]
        (local build)
      Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
      === START OF INFORMATION SECTION ===
      Model Number:                       INTEL SSDPE2KX020T8
      Serial Number:                      PHLJ9500032H2P0BGN
      Firmware Version:                   VDV10131
      PCI Vendor/Subsystem ID:            0x8086
      IEEE OUI Identifier:                0x5cd2e4
      Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
      Unallocated NVM Capacity:           0
      Controller ID:                      0
      Number of Namespaces:               1
      Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
      Namespace 1 Formatted LBA Size:     512
      Namespace 1 IEEE EUI-64:            5cd2e4 75b0070100
      Local Time is:                      Fri Nov 26 13:32:44 2021 EET
      Firmware Updates (0x02):            1 Slot
      Optional Admin Commands (0x000e):   Format Frmw_DL NS_Mngmt
      Optional NVM Commands (0x0006):     Wr_Unc DS_Mngmt
      Maximum Data Transfer Size:         32 Pages
      Warning  Comp. Temp. Threshold:     70 Celsius
      Critical Comp. Temp. Threshold:     80 Celsius
    
      Supported Power States
      St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
       0 +    25.00W       -        -    0  0  0  0        0       0
    
      Supported LBA Sizes (NSID 0x1)
      Id Fmt  Data  Metadt  Rel_Perf
       0 +     512       0         2
       1 -    4096       0         0
    
      === START OF SMART DATA SECTION ===
      SMART overall-health self-assessment test result: PASSED
    
      SMART/Health Information (NVMe Log 0x02)
      Critical Warning:                   0x00
      Temperature:                        34 Celsius
      Available Spare:                    99%
      Available Spare Threshold:          10%
      Percentage Used:                    5%
      Data Units Read:                    550,835,698 [282 TB]
      Data Units Written:                 720,479,182 [368 TB]
      Host Read Commands:                 10,050,305,459
      Host Write Commands:                20,760,365,218
      Controller Busy Time:               1,968
      Power Cycles:                       20
      Power On Hours:                     13,405
      Unsafe Shutdowns:                   16
      Media and Data Integrity Errors:    0
      Error Information Log Entries:      0
      Warning  Comp. Temperature Time:    0
      Critical Comp. Temperature Time:    0
    
      Error Information (NVMe Log 0x01, max 64 entries)
      No Errors Logged
      '
    - command: dmesg --ctime --kernel --level=emerg,alert,crit,err,warn --facility=kern
        | grep 'nvme1n1'
      stdout: ''

If you cannot fix the problem, contact the technical support team, as described in Getting technical support.