Project

General

Profile

Actions

Bug #14022

closed

zpool online -e breaks access to pool

Added by Andy Fiddaman 9 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Attempting to expand a ZFS pool in one operation using zpool online -e does not work, causing pool access (and the machine) to hang.
If the partition is expanded first using format, however, then it succeeds.

root@omnios:~# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c1t0d0    ONLINE       0     0     0

errors: No known data errors

root@omnios:~# zpool get expandsz rpool
NAME   PROPERTY    VALUE     SOURCE
rpool  expandsize  2G        -

root@omnios:~# prtvtoc /dev/dsk/c1t0d0
* /dev/dsk/c1t0d0 EFI partition map
*
* Dimensions:
*         512 bytes/sector
*    20971520 sectors
*    16777149 accessible sectors
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*         First       Sector      Last
*         Sector       Count      Sector
*            34         222         255
*
*                            First       Sector      Last
* Partition  Tag  Flags      Sector       Count      Sector  Mount Directory
       0     12    00          256      524288      524543
       1      4    00       524544    16236255    16760798
       8     11    00     16760799       16384    16777182

root@omnios:~# zpool online -e rpool c1t0d0

[0]> ::stacks -c vdev_online
THREAD           STATE    SOBJ                COUNT
fffffe064f725400 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 zfs`txg_wait_synced_impl+0xa5
                 zfs`txg_wait_synced+0xd
                 zfs`spa_vdev_state_exit+0xb9
                 zfs`vdev_online+0x23a
                 zfs`zfs_ioc_vdev_set_state+0x7b
                 zfs`zfsdev_ioctl+0x1fd
                 cdev_ioctl+0x2b
                 specfs`spec_ioctl+0x45
                 fop_ioctl+0x5b
                 ioctl+0x153

[0]> ::msgbuf
...
WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended; `zpool clear` will be required before the pool can be written to.
[0]> ::zfs_dbgmsg
disk vdev '/dev/dsk/c1t0d0s1': vdev_disk_open: failed to get size
spa=rpool async request task=1
rpool: metaslab allocation failure: zio fffffe065a5b49f8, size 512, error 28
rpool: metaslab allocation failure: zio fffffe065a5d3dd0, size 4096, error 28
rpool: metaslab allocation failure: zio fffffe065a5c2228, size 4096, error 28
rpool: metaslab allocation failure: zio fffffe065a5c3508, size 4096, error 28
rpool: metaslab allocation failure: zio fffffe065a586cb0, size 49152, error 28
rpool: metaslab allocation failure: zio fffffe065a5d1200, size 4096, error 28
rpool: zil block allocation failure: size 36864, error 28
rpool: metaslab allocation failure: zio fffffe065a584930, size 4096, error 28

[0]> ::spa
ADDR                 STATE NAME
fffffe064a8eb000    ACTIVE rpool
[0]> fffffe064a8eb000::spa_vdevs
ADDR             STATE     AUX          DESCRIPTION
fffffe064b8ee000 CANT_OPEN NO_REPLICAS  root
fffffe064b8eb000 CANT_OPEN OPEN_FAILED    /dev/dsk/c1t0d0s1

This is a bhyve VM, so:

carolina# mdb /dev/zvol/rdsk/rpool/test/root
> ::load disk_label
> ::gpt
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0xfcd5dc82 (should be 0xfcd5dc82)
Reserved1: 0 (should be 0x0)
MyLBA: 1 (should be 1)
AlternateLBA: 20971519
FirstUsableLBA: 34
LastUsableLBA: 20971486
DiskGUID: 5897e0f9-b01a-4cd7-d30f-95c37f715c2f
PartitionEntryLBA: 2
NumberOfPartitionEntries: 9
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0x59d5ed6a (should be 0x59d5ed6a)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
0    EFI_SYSTEM          256           524543        0        loader
1    EFI_USR             524544        20955102      0        zfs
2    EFI_UNUSED
3    EFI_UNUSED
4    EFI_UNUSED
5    EFI_UNUSED
6    EFI_UNUSED
7    EFI_UNUSED
8    EFI_RESERVED        20955103      20971486      0

Related issues

Related to illumos gate - Feature #14469: nvme could raise dynamic lun expansion syseventsClosedAndy Fiddaman

Actions
Related to illumos gate - Feature #14471: vioblk could raise dynamic lun expansion syseventsClosedAndy Fiddaman

Actions
Related to illumos gate - Bug #14593: zpool online -e hangs forever after 14022Closed

Actions
Related to illumos gate - Feature #14600: dkio(4I) needs an ioctl to re-read the disk labelNew

Actions
Actions #1

Updated by Toomas Soome 9 months ago

Andy Fiddaman wrote:

Attempting to expand a ZFS pool in one operation using zpool online -e does not work, causing pool access (and the machine) to hang.
If the partition is expanded first using format, however, then it succeeds.

EFI wholedisk or custom partitioning?

Actions #2

Updated by Andy Fiddaman 9 months ago

  • Description updated (diff)
Actions #3

Updated by Andy Fiddaman 9 months ago

  • Description updated (diff)
Actions #4

Updated by Andy Fiddaman 9 months ago

Toomas Soome wrote in #note-1:

EFI wholedisk or custom partitioning?

I've expanded the description - the reproducer I'm using is EFI whole disk, so that zpool online -e tries to expand it

Actions #5

Updated by Andy Fiddaman 9 months ago

The pool ends up being suspended because ldi_get_size() returns -1 - called from vdev_disk_open() in the re-open case

dtrace -n 'fbt::ldi_get_size:entry{v = (struct ldi_handle *)arg0; print(*v); self->t++}' -n 'fbt::ldi*:entry/self->t/{}' -n 'fbt::ldi*:return/self->t/{trace(arg1)}' -F
CPU FUNCTION
  2  -> ldi_get_size                          struct ldi_handle {
    struct ldi_handle *lh_next = 0
    uint_t lh_ref = 0x1
    uint_t lh_flags = 0
    uint_t lh_type = 0x2
    struct ldi_ident *lh_ident = 0xfffffe064a4abbf8
    vnode_t *lh_vp = 0xfffffe064a4c3a40
    kmutex_t [1] lh_lock = [
        kmutex_t {
            void *[1] _opaque = [ 0 ]
        }
    ]
    struct ldi_event *lh_events = 0
}
  2   | ldi_get_size:entry
  2    -> ldi_get_otyp
  2    <- ldi_get_otyp                                        0
  2    -> ldi_prop_exists
  2    <- ldi_prop_exists                                     0
  2    -> ldi_prop_exists
  2    <- ldi_prop_exists                                     0
  2    -> ldi_prop_exists
  2    <- ldi_prop_exists                                     0
  2    -> ldi_prop_exists
  2    <- ldi_prop_exists                                     0
  2  <- ldi_get_size                                 4294967295
Actions #6

Updated by Andy Fiddaman 3 months ago

What is happening here is that during an online expansion of an EFI disk libzfs:zpool_vdev_online() is calling libzfs:zpool_relabel_disk() which is turn calling libefi:efi_use_whole_disk(). efi_use_whole_disk() uses the DKIOCSETEFI ioctl to write the new GPT table header, which ends up calling cmlb_dkio_set_efi(). After writing the table, this function sets cl_f_geometry_is_valid = B_FALSE; on the LUN.

When libzfs then continues and tries to re-open the device, there is a race where it can attempt to read the device size while the geometry is still marked as invalid. Adding a small pause here makes the resize work most of the time.

  0    -> vdev_reopen
  0      -> vdev_close
  0      <- vdev_close                                        1
  0      -> vdev_open
  0        -> vdev_disk_open
  0          -> ldi_get_size
  0            -> ldi_prop_exists
  0              -> cmlb_prop_op
  0                -> cmlb_is_valid
  0                <- cmlb_is_valid                           0           <- due to cl_f_geometry_is_valid being false
  0              <- cmlb_prop_op                              1
  0            <- ldi_prop_exists                             0
  0          <- ldi_get_size                         4294967295
  0           | zfs_dbgmsg:zfs-dbgmsg         disk vdev '/dev/dsk/c1t0d0s1': vdev_disk_open: failed to get size
  0        <- vdev_disk_open                                 22
  0        -> vdev_set_state
  0          -> vdev_is_dead
Actions #7

Updated by Joshua M. Clulow 3 months ago

Perhaps rather than a small pause we could adjust things so that we can detect this condition explicitly and retry until it resolves?

Actions #8

Updated by Electric Monk 3 months ago

  • Gerrit CR set to 2004
Actions #9

Updated by Andy Fiddaman 3 months ago

The small pause was just a diagnostic (dtrace chill) to try and isolate the problem. I have a proposed fix that is at least in the right area.

Actions #10

Updated by Andy Fiddaman 3 months ago

I've tested the patch in the associated gerrit review on an OmniOS VM running under bhyve. Prior to this the pool faulted 100% of the time in a run of 20 attempts, and afterwards I had 20 successful consecutive runs. Here is some output from one of them:

root@test:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  7.50G  2.15G  5.35G        -        3G     2%    28%  1.00x  ONLINE  -
root@test:~# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c1t0d0    ONLINE       0     0     0

errors: No known data errors
root@test:~# prtvtoc -s /dev/rdsk/c1t0d0
*                            First       Sector      Last
* Partition  Tag  Flags      Sector       Count      Sector  Mount Directory
       0     12    00          256      524288      524543
       1      4    00       524544    16236255    16760798
       8     11    00     16760799       16384    16777182

root@test:~# zpool online -e rpool c1t0d0

root@test:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  10.5G  2.15G  8.35G        -         -     1%    20%  1.00x  ONLINE  -

root@test:~# prtvtoc -s /dev/rdsk/c1t0d0
*                            First       Sector      Last
* Partition  Tag  Flags      Sector       Count      Sector  Mount Directory
       0     12    00          256      524288      524543
       1      4    00       524544    22527711    23052254
       8     11    00     23052255       16384    23068638

Using dtrace, I can see that the geometry is being revalidated as a result of the new ioctl() call:

   0  47448          cmlb_dkio_set_efi:entry
  0  47522                 cmlb_ioctl:entry   DKIOCSTATE
  0  47418     cmlb_validate_geometry:entry
              cmlb`cmlb_ioctl+0x9d
              blkdev`bd_ioctl+0xc5
              genunix`cdev_ioctl+0x3f
              specfs`spec_ioctl+0x55
              genunix`fop_ioctl+0x40
              genunix`ioctl+0x144
              unix`sys_syscall+0x2ae

  0   1572            zfs_dbgmsg:zfs-dbgmsg spa=rpool async request task=1

This is the script I used here:

#!/usr/sbin/dtrace -Cs

zfs-dbgmsg{printf("%s\n", stringof(arg0))}

fbt::cmlb_ioctl:entry/arg2 == 0x40d/ { self->t++; trace("DKIOCSTATE") }
fbt::cmlb_ioctl:return/self->t/ { self->t--; }
fbt::cmlb_validate_geometry:entry/self->t/ { stack() }
fbt::cmlb_dkio_set_efi:entry {}
Actions #11

Updated by Andy Fiddaman 3 months ago

  • Related to Feature #14469: nvme could raise dynamic lun expansion sysevents added
Actions #12

Updated by Andy Fiddaman 3 months ago

For completeness, I also checked that there is no regression around legacy pools on MBR-partitioned drives:

omnios@bhyvetest:~$ zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        rpool                      ONLINE       0     0     0
          c1t589CFC2045C20001d0s0  ONLINE       0     0     0

errors: No known data errors
omnios@bhyvetest:~$ pfexec fdisk -W - c1t589CFC2045C20001d0p0 | tail -5
* Id    Act  Bhead  Bsect  Bcyl    Ehead  Esect  Ecyl    Rsect      Numsect
  191   128  65     2      0       255    63     1023    4096       16773120
  0     0    0      0      0       0      0      0       0          0
  0     0    0      0      0       0      0      0       0          0
  0     0    0      0      0       0      0      0       0          0

omnios@bhyvetest:~$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  7.50G   767M  6.75G        -         -     0%     9%  1.00x  ONLINE  -

omnios@bhyvetest:~$ pfexec /usr/lib/sysevent/sysevent -r
nvlist version: 0
        date = 2022-02-18T09:54:22.558Z
        type = ready

... at this point, I resized the underlying ZVOL from outside the VM and checked that the sysevent was
... received.

nvlist version: 0
        date = 2022-02-18T09:54:52.108Z
        type = event
        vendor = SUNW
        publisher = nvme
        class = EC_dev_status
        subclass = ESC_dev_dle
        pid = 0
        data = (embedded nvlist)
        nvlist version: 0
                phys_path = /devices/pci@0,0/pcifb5d,a0a@4/blkdev@w589CFC2045C20001,0:a
        (end data)

.. and, as expected, no change on the pool

root@bhyvetest:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  7.50G   767M  6.75G        -         -     0%     9%  1.00x  ONLINE  -

root@bhyvetest:~# zpool online -e rpool c1t589CFC2045C20001d0s0

root@bhyvetest:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  7.50G   767M  6.75G        -         -     0%     9%  1.00x  ONLINE  -
Actions #13

Updated by Andy Fiddaman 3 months ago

and here is a test with a pool on a GPT-partitioned disk, first without a UEFI boot partition:

root@bhyvetest:~# zpool create bob c2t0d0
root@bhyvetest:~# prtvtoc -s /dev/rdsk/c2t0d0p0
*                            First       Sector      Last
* Partition  Tag  Flags      Sector       Count      Sector  Mount Directory
       0      4    00          256    20954847    20955102
       8     11    00     20955103       16384    20971486
root@bhyvetest:~# zpool list bob
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bob   9.50G  1.12M  9.50G        -         -     0%     0%  1.00x  ONLINE  -
root@bhyvetest:~# zdb | grep phys_path
            phys_path: '/pci@0,0/pci1af4,2@5/blkdev@0,0:a'          <-- note 'a'

... volume resize, outside the VM

root@bhyvetest:~# zpool list bob
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bob   9.50G  1.12M  9.50G        -       11G     0%     0%  1.00x  ONLINE  -
root@bhyvetest:~# zpool online -e bob c2t0d0
root@bhyvetest:~# zpool list bob
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bob   20.5G  1.38M  20.5G        -         -     0%     0%  1.00x  ONLINE  -

and with a UEFI boot partition:

root@bhyvetest:~# zpool create -B bob c2t0d0
root@bhyvetest:~# prtvtoc -s /dev/rdsk/c2t0d0p0
*                            First       Sector      Last
* Partition  Tag  Flags      Sector       Count      Sector  Mount Directory
       0     12    00          256      524288      524543
       1      4    00       524544    43499231    44023774
       8     11    00     44023775       16384    44040158
root@bhyvetest:~# zpool list bob
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bob   20.5G   744K  20.5G        -        1G     0%     0%  1.00x  ONLINE  -
root@bhyvetest:~# zdb | grep phys_path
            phys_path: '/pci@0,0/pci1af4,2@5/blkdev@0,0:b'          <-- note 'b'

... volume resize, outside the VM

root@bhyvetest:~# zpool online -e bob c2t0d0
root@bhyvetest:~# zpool list bob
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bob   21.5G  1.10M  21.5G        -         -     0%     0%  1.00x  ONLINE  -

Actions #14

Updated by Andy Fiddaman 3 months ago

  • Related to Feature #14471: vioblk could raise dynamic lun expansion sysevents added
Actions #15

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit c2cd3a449cfa117e3a164f66931fa6c26c762945

commit  c2cd3a449cfa117e3a164f66931fa6c26c762945
Author: Andy Fiddaman <omnios@citrus-it.co.uk>
Date:   2022-02-18T18:29:52.000Z

    14022 zpool online -e breaks access to pool
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Jason King <jason.brian.king+illumos@gmail.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions #16

Updated by Joshua M. Clulow about 2 months ago

  • Related to Bug #14593: zpool online -e hangs forever after 14022 added
Actions #17

Updated by Joshua M. Clulow about 2 months ago

  • Related to Feature #14600: dkio(4I) needs an ioctl to re-read the disk label added
Actions

Also available in: Atom PDF