Project

General

Profile

Actions

Bug #14115

closed

dumping to zvol on raidz will corrupt the pool

Added by Joshua M. Clulow 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Immediate
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

It would appear that dumping to a zvol in a pool that uses raidz can induce corruption in data files within the pool.

root@unknown:~# zfs list fire/dump
NAME        USED  AVAIL  REFER  MOUNTPOINT
fire/dump  2.00G   101M  2.00G  -

root@unknown:~# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/fire/dump (dedicated)
Savecore directory: /var/crash/unknown
  Savecore enabled: yes
   Save compressed: on

root@unknown:~# zpool status -v fire
  pool: fire
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 7.94M in 0 days 00:00:12 with 1 errors on Sat Sep 25 17:27:07 2021
config:

        NAME        STATE     READ WRITE CKSUM
        fire        DEGRADED     0     0     1
          raidz2-0  DEGRADED     0     0     2
            c4t0d0  DEGRADED     0     0    30  too many errors
            c6t0d0  DEGRADED     0     0    32  too many errors
            c7t0d0  DEGRADED     0     0    35  too many errors
            c8t0d0  DEGRADED     0     0    30  too many errors

errors: Permanent errors have been detected in the following files:

        /fire/trash/44

This appears to happen regardless of the ashift (blocksize) value for the pool. I have reproduced this with at least ashift 9, 12, 13, and 14. It appears to affect all raid levels; i.e., raidz1, raidz2, and raidz3. It does not appear to occur if dumping to a zvol in a striped pool with no redundancy.


Related issues

Related to illumos gate - Bug #12085: zvol_dumpio would be cleaner with vdev_op_dumpioClosedJason King

Actions
Actions #1

Updated by Joshua M. Clulow 4 months ago

To reproduce, create a virtual machine with a regular root disk and four small disks (4GB is sufficient):

root@unknown:~# diskinfo
TYPE    DISK                    VID      PID              SIZE          RMV SSD
-       c5t0d0                  Virtio   Block Device       10.00 GiB   no  no
-       c4t0d0                  Virtio   Block Device        4.00 GiB   no  no
-       c6t0d0                  Virtio   Block Device        4.00 GiB   no  no
-       c7t0d0                  Virtio   Block Device        4.00 GiB   no  no
-       c8t0d0                  Virtio   Block Device        4.00 GiB   no  no

Create a pool and attach it as a dump device:

#!/bin/bash

set -o errexit
set -o pipefail
set -o xtrace

dumpadm -d none || true
zpool destroy fire || true
zpool create -o ashift=12 fire raidz2 c4t0d0 c6t0d0 c7t0d0 c8t0d0
zfs create fire/trash
zfs create -V 2G fire/dump
dumpadm -d /dev/zvol/dsk/fire/dump

Fill the pool with random junk:

#!/bin/bash

set -o errexit
set -o pipefail

while :; do
        avail=$(zfs list -Hp -o avail fire/trash)
        if (( $avail < (200 * 1024 * 1024) )); then
                break
        fi

        for (( j = 0; j < 10000; j++ )); do
                x=/fire/trash/$j
                if [[ -f $x ]]; then
                        continue;
                fi
                break;
        done

        echo $x
        dd if=/dev/urandom of=$x bs=1M count=100

        zfs list fire
done

zfs list fire

Scrub the pool to confirm there is not yet any corruption.

Then initiate a panic; e.g., with uadmin 5 1.

After reboot, scrub the pool and watch the pool status for corruption.

More details on the specific locations of the corruption are available in the FMA error log; e.g., visible via fmdump -eVp, looking for ereport.fs.zfs.checksum records.

In the first pool where I experienced this issue, where I have real data, the corruption was in years old data that has routinely scrubbed cleanly until a panic and dump event this week. Though it is never conclusive, the SMART data suggests the drives are in good health.

Actions #2

Updated by Joshua M. Clulow 4 months ago

  • Related to Bug #12085: zvol_dumpio would be cleaner with vdev_op_dumpio added
Actions #3

Updated by Joshua M. Clulow 4 months ago

I'm still trying to understand the layering here, but it seems like #12085 replaced the old physio routines with dumpio routines. As a raidz vdev necessarily sits on top of at least one additional child vdev layer (i.e., disk vdevs) it seems that vdev_raidz_dumpio() will call through vdev_disk_dumpio() to perform the actual disk write.

The old vdev_disk_physio() does not appear to have accounted for the vdev label by adjusting the offset of the I/O. The new vdev_disk_dumpio() unconditionally does:

static int
vdev_disk_dumpio(vdev_t *vd, caddr_t data, size_t size,
    uint64_t offset, uint64_t origoffset __unused, boolean_t doread,
    boolean_t isdump)
{
...
        offset += VDEV_LABEL_START_SIZE;

This calculation would appear to be redundant with the offset adjustment that remains in vdev_raidz_dumpio():

static int
vdev_raidz_dumpio(vdev_t *vd, caddr_t data, size_t size,
    uint64_t offset, uint64_t origoffset, boolean_t doread, boolean_t isdump)
{
...
                /*
                 * Note that the child vdev will have a vdev label at the start
                 * of its range of offsets, hence the need for
                 * VDEV_LABEL_OFFSET().  See zio_vdev_child_io() for another
                 * example of why this calculation is needed.
                 */
                if ((err = cvd->vdev_ops->vdev_op_dumpio(cvd,
                    ((char *)abd_to_buf(rc->rc_abd)) + colskip, colsize,
                    VDEV_LABEL_OFFSET(rc->rc_offset) + colskip, 0,
                    doread, isdump)) != 0)
                        break;

Looking in zio_vdev_child_io(), as mentioned in the comment, we see that the label offset adjustment is, there, predicated on the current vdev being itself a leaf vdev (which raidz is not):

        if (vd->vdev_ops->vdev_op_leaf) {
                ASSERT0(vd->vdev_children);
                offset += VDEV_LABEL_START_SIZE;
        }

I think we are, at this point, bumping this twice. I have confirmed that the zvol_extent_t list for the dump device matches exactly the list of L0 blocks visible in the pool for the dump zvol via zdb.

Indeed, VDEV_LABEL_START_SIZE would seem to be:

> ::sizeof vdev_label_t
sizeof (vdev_label_t) = 0x40000
> (2*40000) + (0t7<<0t19) =J
                400000

Looking at a file that appears in zpool status -v with od I found the offset of the start of the corruption. Then, I looked at zdb -vvvvv -O for the file to locate the blocks on disk:

         2f40000  L0 0:257f6a000:42000 20000L/20000P F=1 B=50/50
         2f60000  L0 0:257fac000:42000 20000L/20000P F=1 B=50/50
         2f80000  L0 0:257fee000:42000 20000L/20000P F=1 B=50/50
--- corruption starts here
         2fa0000  L0 0:d0023000:42000 20000L/20000P F=1 B=50/50
         2fc0000  L0 0:d0065000:42000 20000L/20000P F=1 B=50/50
         2fe0000  L0 0:d00a7000:42000 20000L/20000P F=1 B=50/50

The file in question should contain all A characters, and I was able to see that indeed the (uncompressed) disk blocks do -- right up until surprise apparently dump data shows up at offset 0x5000 in the block:

# zdb -R fire 0:d0023000:42000
0:d0023000:42000
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
000010:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
000020:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
000030:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
000040:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
000050:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
000060:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
...
004fd0:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
004fe0:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
004ff0:  4141414141414141  4141414141414141  AAAAAAAAAAAAAAAA
005000:  e500fe70fe6f7473  e400fe84fd3c1cf6  sto.p.....<.....
005010:  2326002e8900fe84  f8f02b54de55002e  ......&#..U.T+..
005020:  fe201410285518a8  429cdf0afc941d00  ..U(.. ........B
005030:  00163cfe3cfc7ce5  1014c30016f31011  .|.<.<..........
...

Taking that offset, plus 0x5000, we get 0xd0028000. Subtract the label offset, and we get cfc28000. That address does appear within one of the L0 blocks for the dump device:

Dataset fire/dump [ZVOL], ID 96, cr_txg 8, 2.00G, 2 objects, rootbp DVA[0]=<0:a0030000:3000> DVA[1]=<0:e0021000:3000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=3147L/3147P fill=2 cksum=281312c74:9a159f6035d:1288a1a6b45667:17cdee8af55b3e11
...
         7a00000   L0 0:cfc24000:42000 20000L/20000P F=1 B=17/17
Actions #4

Updated by Joshua M. Clulow 4 months ago

  • Gerrit CR set to 1731
Actions #5

Updated by Joshua M. Clulow 4 months ago

Testing Notes

After booting the change I have put up for review, I was able to recover the dump correctly with savecore and then I was able to scrub the pool without any checksum errors.

Actions #6

Updated by Electric Monk 4 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit dcbb6cd173ed2c0ad9b52235b9251d2892bb219a

commit  dcbb6cd173ed2c0ad9b52235b9251d2892bb219a
Author: Joshua M. Clulow <josh@sysmgr.org>
Date:   2021-09-27T19:29:04.000Z

    14115 dumping to zvol on raidz will corrupt the pool
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF