68046: Spotting and fixing hardware I/O errors (disk errors) on Linux

    Last update: 26-03-2021

    Introduction

    The following material is intended to serve as an example and a reference guide to help spot when disk I/O errors coming from the hardware are creating problems for the backup agent, with somewhat varying error messages shown in the backup console and in PCS logs. The existence of hardware disk errors is not only a problem for the creation of backups - it can pose a hidden yet significant danger to the stability and operability of the customer's machine, and can easily lead to data loss - so spotting those on time can be crucial.

    Symptoms

    Error messages in the Backup Console 

    Backup fails with "Common I/O error."

    Backup fails with "Cannot read the snapshot of the volume."

    Error messages in the mms and/or pcs logs:

    Example:

    ------------------------
    Error code: 21561347
    Fields: {"$module":"disk_bundle_lxa64_26077"}
    Message: Backup has failed.
    ------------------------
    Error code: 66596
    Fields: {"$module":"disk_bundle_lxa64_26077"}
    Message: Failed to commit operations.
    ------------------------
    Error code: 458755
    Fields: {"$module":"disk_bundle_lxa64_26077"}
    Message: Read error.
    ------------------------
    Error code: 5832708
    Fields: {"$module":"disk_bundle_lxa64_26077","device":"/dev/mapper/pve-root"}
    Message: Cannot read the snapshot of the volume.
    ------------------------

    Error messages in the Linux kernel logs ( /var/log/messages files, outputs of dmesg command): 

    Some examples of I/O-related errors are listed below. This list is not exhaustive.

    Keywords to look for

    While disk, input-output and storage subsystem errors vary a lot depending on multiple factors (such as the version of the Linux kernel, the exact type of storage controller and storage attachment -- some of those would look slightly different if e.g. virtual disks are used inside a hypervisor, or if a disk/volume is attached via iSCSI or Fibre Channel), there are several strings/messages and patterns to look for. This is not an exclusive list:

    • ata x.yz ... DRDY
    • ata x.yz failed command
    • WRITE FPDMA QUEUED
    • READ FPDMA QUEUED
    • print_req_error
    • I/O error ... <device is normally named, e.g. sda or sdb or sdc disk ID...> < sector(s) NNNN which cannot be read/written is usually mentioned>
    • hostbyte
    • driverbyte
    • DRIVER_SENSE
    • Sense Key : Medium Error
    • Add. Sense: Unrecovered read error - auto reallocate failed
    • ata <ID>: EH complete

    Impact on backup and/or restore activities

    Impact on backup and recovery activities varies, depending on what operation fails, how it fails, whether it fails every time or only occasionally: e.g. a bad area or sector on disk may not always be permanently bad -- sometimes the hardware can recover/repair the lighter errors on its own, in the background; sometimes these errors only occur during unfavorable physical conditions such as excessive vibration in the server/computer/datacenter. It does matter what is stored in the problematic sectors or areas of disk -- some parts containing critical LVM or file system metadata, or the OS bootloader and kernel, or the system swap partition/file/area, are usually more important than others.

    If the issues are smaller, and do not affect critical areas, the backup agent's engine may be able to automatically switch into sector-by-sector mode: this can be controlled via the Options sub-menu in the Backup Plan.

    However, in practice, in most cases, the I/O errors are serious enough to make even sector-by-sector backups fail (always or intermittently).

    Backup creation activities are affected during either the snapshot creation stage by snapapi26 kernel module, or during the actual reading of data from the snapshot in order to send it to the backup.

    Restoring backups to problematic disks usually fails when data in the exact bad spots need to be overwritten, but if critical metadata of the LVM ort FS is corrupted/non-readable/non-writable, a wide variety of errors and messages may appear. 

    What to do (reactively AND proactively)

    • Fix or replace the faulty hardware.
    • Repair/resync hardware or software array (if using one).
    • Periodically run fsck in a mode that checks the entire disk surface (all blocks). Consult Linux manpages (man fsck) on how to do this. Use the "badblocks" Linux utility.
    • If using hardware or software RAID solutions, configure them to periodically scrub or do patrol reads to detect bad/unstable sectors and disks as early as possible.
    • Use advanced file systems like ZFS and BTRFS which have native features to detect and (if configured properly) self-heal some of such errors.
    • Take Entire-machine backups frequently enough.

    Key takeaways

    • It is often important to check the Linux kernel logs ( /var/log/messages files, dmesg output) for hardware errors when creating backups fails with snapshot errors, unspecified I/O errors, "cannot read..." errors, and similar. The kernel logs are, in such cases, much more precise than the fairly high-level (and often generic) error messages which the backup agent can and does report -- it is a user-space application after all, and it cannot always "see" nor interpret low-level errors of the I/O subsystem.
    • If the hardware cannot read it (reliably), then the Linux kernel/the OS will not be able to read it, then the backup agent will not be able to process the data correctly and thus the backup will keep failing until the hardware problems get fixed (most often, until the bad disk(s), cable(s), or HBA/RAID card/adapter card get replaced).
    • The backup engine is NOT designed to be able to backup barely functioning/marginal storage hardware; it is NOT a specialized data recovery/disk repair software tool. Specialized data recovery tools can sometimes extract data from very unstable sectors, using specialized techniques like retrying a non-responding sector tens or hundreds of times.

    Tags: