Bug #14022
closedzpool online -e breaks access to pool
100%
Description
Attempting to expand a ZFS pool in one operation using zpool online -e
does not work, causing pool access (and the machine) to hang.
If the partition is expanded first using format
, however, then it succeeds.
root@omnios:~# zpool status pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 errors: No known data errors root@omnios:~# zpool get expandsz rpool NAME PROPERTY VALUE SOURCE rpool expandsize 2G - root@omnios:~# prtvtoc /dev/dsk/c1t0d0 * /dev/dsk/c1t0d0 EFI partition map * * Dimensions: * 512 bytes/sector * 20971520 sectors * 16777149 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First Sector Last * Sector Count Sector * 34 222 255 * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 12 00 256 524288 524543 1 4 00 524544 16236255 16760798 8 11 00 16760799 16384 16777182 root@omnios:~# zpool online -e rpool c1t0d0
[0]> ::stacks -c vdev_online THREAD STATE SOBJ COUNT fffffe064f725400 SLEEP CV 1 swtch+0x133 cv_wait+0x68 zfs`txg_wait_synced_impl+0xa5 zfs`txg_wait_synced+0xd zfs`spa_vdev_state_exit+0xb9 zfs`vdev_online+0x23a zfs`zfs_ioc_vdev_set_state+0x7b zfs`zfsdev_ioctl+0x1fd cdev_ioctl+0x2b specfs`spec_ioctl+0x45 fop_ioctl+0x5b ioctl+0x153 [0]> ::msgbuf ... WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended; `zpool clear` will be required before the pool can be written to.
[0]> ::zfs_dbgmsg disk vdev '/dev/dsk/c1t0d0s1': vdev_disk_open: failed to get size spa=rpool async request task=1 rpool: metaslab allocation failure: zio fffffe065a5b49f8, size 512, error 28 rpool: metaslab allocation failure: zio fffffe065a5d3dd0, size 4096, error 28 rpool: metaslab allocation failure: zio fffffe065a5c2228, size 4096, error 28 rpool: metaslab allocation failure: zio fffffe065a5c3508, size 4096, error 28 rpool: metaslab allocation failure: zio fffffe065a586cb0, size 49152, error 28 rpool: metaslab allocation failure: zio fffffe065a5d1200, size 4096, error 28 rpool: zil block allocation failure: size 36864, error 28 rpool: metaslab allocation failure: zio fffffe065a584930, size 4096, error 28
[0]> ::spa ADDR STATE NAME fffffe064a8eb000 ACTIVE rpool [0]> fffffe064a8eb000::spa_vdevs ADDR STATE AUX DESCRIPTION fffffe064b8ee000 CANT_OPEN NO_REPLICAS root fffffe064b8eb000 CANT_OPEN OPEN_FAILED /dev/dsk/c1t0d0s1
This is a bhyve VM, so:
carolina# mdb /dev/zvol/rdsk/rpool/test/root > ::load disk_label > ::gpt Signature: EFI PART (valid) Revision: 1.0 HeaderSize: 92 bytes HeaderCRC32: 0xfcd5dc82 (should be 0xfcd5dc82) Reserved1: 0 (should be 0x0) MyLBA: 1 (should be 1) AlternateLBA: 20971519 FirstUsableLBA: 34 LastUsableLBA: 20971486 DiskGUID: 5897e0f9-b01a-4cd7-d30f-95c37f715c2f PartitionEntryLBA: 2 NumberOfPartitionEntries: 9 SizeOfPartitionEntry: 0x80 bytes PartitionEntryArrayCRC32: 0x59d5ed6a (should be 0x59d5ed6a) PART TYPE STARTLBA ENDLBA ATTR NAME 0 EFI_SYSTEM 256 524543 0 loader 1 EFI_USR 524544 20955102 0 zfs 2 EFI_UNUSED 3 EFI_UNUSED 4 EFI_UNUSED 5 EFI_UNUSED 6 EFI_UNUSED 7 EFI_UNUSED 8 EFI_RESERVED 20955103 20971486 0
Related issues
Updated by Toomas Soome 9 months ago
Andy Fiddaman wrote:
Attempting to expand a ZFS pool in one operation using
zpool online -e
does not work, causing pool access (and the machine) to hang.
If the partition is expanded first usingformat
, however, then it succeeds.
EFI wholedisk or custom partitioning?
Updated by Andy Fiddaman 9 months ago
Toomas Soome wrote in #note-1:
EFI wholedisk or custom partitioning?
I've expanded the description - the reproducer I'm using is EFI whole disk, so that zpool online -e
tries to expand it
Updated by Andy Fiddaman 9 months ago
The pool ends up being suspended because ldi_get_size()
returns -1 - called from vdev_disk_open()
in the re-open case
dtrace -n 'fbt::ldi_get_size:entry{v = (struct ldi_handle *)arg0; print(*v); self->t++}' -n 'fbt::ldi*:entry/self->t/{}' -n 'fbt::ldi*:return/self->t/{trace(arg1)}' -F CPU FUNCTION 2 -> ldi_get_size struct ldi_handle { struct ldi_handle *lh_next = 0 uint_t lh_ref = 0x1 uint_t lh_flags = 0 uint_t lh_type = 0x2 struct ldi_ident *lh_ident = 0xfffffe064a4abbf8 vnode_t *lh_vp = 0xfffffe064a4c3a40 kmutex_t [1] lh_lock = [ kmutex_t { void *[1] _opaque = [ 0 ] } ] struct ldi_event *lh_events = 0 } 2 | ldi_get_size:entry 2 -> ldi_get_otyp 2 <- ldi_get_otyp 0 2 -> ldi_prop_exists 2 <- ldi_prop_exists 0 2 -> ldi_prop_exists 2 <- ldi_prop_exists 0 2 -> ldi_prop_exists 2 <- ldi_prop_exists 0 2 -> ldi_prop_exists 2 <- ldi_prop_exists 0 2 <- ldi_get_size 4294967295
Updated by Andy Fiddaman 3 months ago
What is happening here is that during an online expansion of an EFI disk libzfs:zpool_vdev_online()
is calling libzfs:zpool_relabel_disk()
which is turn calling libefi:efi_use_whole_disk()
. efi_use_whole_disk()
uses the DKIOCSETEFI
ioctl to write the new GPT table header, which ends up calling cmlb_dkio_set_efi()
. After writing the table, this function sets cl_f_geometry_is_valid = B_FALSE;
on the LUN.
When libzfs then continues and tries to re-open the device, there is a race where it can attempt to read the device size while the geometry is still marked as invalid. Adding a small pause here makes the resize work most of the time.
0 -> vdev_reopen 0 -> vdev_close 0 <- vdev_close 1 0 -> vdev_open 0 -> vdev_disk_open 0 -> ldi_get_size 0 -> ldi_prop_exists 0 -> cmlb_prop_op 0 -> cmlb_is_valid 0 <- cmlb_is_valid 0 <- due to cl_f_geometry_is_valid being false 0 <- cmlb_prop_op 1 0 <- ldi_prop_exists 0 0 <- ldi_get_size 4294967295 0 | zfs_dbgmsg:zfs-dbgmsg disk vdev '/dev/dsk/c1t0d0s1': vdev_disk_open: failed to get size 0 <- vdev_disk_open 22 0 -> vdev_set_state 0 -> vdev_is_dead
Updated by Joshua M. Clulow 3 months ago
Perhaps rather than a small pause we could adjust things so that we can detect this condition explicitly and retry until it resolves?
Updated by Andy Fiddaman 3 months ago
The small pause was just a diagnostic (dtrace chill) to try and isolate the problem. I have a proposed fix that is at least in the right area.
Updated by Andy Fiddaman 3 months ago
I've tested the patch in the associated gerrit review on an OmniOS VM running under bhyve. Prior to this the pool faulted 100% of the time in a run of 20 attempts, and afterwards I had 20 successful consecutive runs. Here is some output from one of them:
root@test:~# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT rpool 7.50G 2.15G 5.35G - 3G 2% 28% 1.00x ONLINE - root@test:~# zpool status pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 errors: No known data errors root@test:~# prtvtoc -s /dev/rdsk/c1t0d0 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 12 00 256 524288 524543 1 4 00 524544 16236255 16760798 8 11 00 16760799 16384 16777182 root@test:~# zpool online -e rpool c1t0d0 root@test:~# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT rpool 10.5G 2.15G 8.35G - - 1% 20% 1.00x ONLINE - root@test:~# prtvtoc -s /dev/rdsk/c1t0d0 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 12 00 256 524288 524543 1 4 00 524544 22527711 23052254 8 11 00 23052255 16384 23068638
Using dtrace, I can see that the geometry is being revalidated as a result of the new ioctl() call:
0 47448 cmlb_dkio_set_efi:entry 0 47522 cmlb_ioctl:entry DKIOCSTATE 0 47418 cmlb_validate_geometry:entry cmlb`cmlb_ioctl+0x9d blkdev`bd_ioctl+0xc5 genunix`cdev_ioctl+0x3f specfs`spec_ioctl+0x55 genunix`fop_ioctl+0x40 genunix`ioctl+0x144 unix`sys_syscall+0x2ae 0 1572 zfs_dbgmsg:zfs-dbgmsg spa=rpool async request task=1
This is the script I used here:
#!/usr/sbin/dtrace -Cs zfs-dbgmsg{printf("%s\n", stringof(arg0))} fbt::cmlb_ioctl:entry/arg2 == 0x40d/ { self->t++; trace("DKIOCSTATE") } fbt::cmlb_ioctl:return/self->t/ { self->t--; } fbt::cmlb_validate_geometry:entry/self->t/ { stack() } fbt::cmlb_dkio_set_efi:entry {}
Updated by Andy Fiddaman 3 months ago
- Related to Feature #14469: nvme could raise dynamic lun expansion sysevents added
Updated by Andy Fiddaman 3 months ago
For completeness, I also checked that there is no regression around legacy pools on MBR-partitioned drives:
omnios@bhyvetest:~$ zpool status pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c1t589CFC2045C20001d0s0 ONLINE 0 0 0 errors: No known data errors omnios@bhyvetest:~$ pfexec fdisk -W - c1t589CFC2045C20001d0p0 | tail -5 * Id Act Bhead Bsect Bcyl Ehead Esect Ecyl Rsect Numsect 191 128 65 2 0 255 63 1023 4096 16773120 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 omnios@bhyvetest:~$ zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT rpool 7.50G 767M 6.75G - - 0% 9% 1.00x ONLINE - omnios@bhyvetest:~$ pfexec /usr/lib/sysevent/sysevent -r nvlist version: 0 date = 2022-02-18T09:54:22.558Z type = ready ... at this point, I resized the underlying ZVOL from outside the VM and checked that the sysevent was ... received. nvlist version: 0 date = 2022-02-18T09:54:52.108Z type = event vendor = SUNW publisher = nvme class = EC_dev_status subclass = ESC_dev_dle pid = 0 data = (embedded nvlist) nvlist version: 0 phys_path = /devices/pci@0,0/pcifb5d,a0a@4/blkdev@w589CFC2045C20001,0:a (end data) .. and, as expected, no change on the pool root@bhyvetest:~# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT rpool 7.50G 767M 6.75G - - 0% 9% 1.00x ONLINE - root@bhyvetest:~# zpool online -e rpool c1t589CFC2045C20001d0s0 root@bhyvetest:~# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT rpool 7.50G 767M 6.75G - - 0% 9% 1.00x ONLINE -
Updated by Andy Fiddaman 3 months ago
and here is a test with a pool on a GPT-partitioned disk, first without a UEFI boot partition:
root@bhyvetest:~# zpool create bob c2t0d0 root@bhyvetest:~# prtvtoc -s /dev/rdsk/c2t0d0p0 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 4 00 256 20954847 20955102 8 11 00 20955103 16384 20971486 root@bhyvetest:~# zpool list bob NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT bob 9.50G 1.12M 9.50G - - 0% 0% 1.00x ONLINE - root@bhyvetest:~# zdb | grep phys_path phys_path: '/pci@0,0/pci1af4,2@5/blkdev@0,0:a' <-- note 'a' ... volume resize, outside the VM root@bhyvetest:~# zpool list bob NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT bob 9.50G 1.12M 9.50G - 11G 0% 0% 1.00x ONLINE - root@bhyvetest:~# zpool online -e bob c2t0d0 root@bhyvetest:~# zpool list bob NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT bob 20.5G 1.38M 20.5G - - 0% 0% 1.00x ONLINE -
and with a UEFI boot partition:
root@bhyvetest:~# zpool create -B bob c2t0d0 root@bhyvetest:~# prtvtoc -s /dev/rdsk/c2t0d0p0 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 12 00 256 524288 524543 1 4 00 524544 43499231 44023774 8 11 00 44023775 16384 44040158 root@bhyvetest:~# zpool list bob NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT bob 20.5G 744K 20.5G - 1G 0% 0% 1.00x ONLINE - root@bhyvetest:~# zdb | grep phys_path phys_path: '/pci@0,0/pci1af4,2@5/blkdev@0,0:b' <-- note 'b' ... volume resize, outside the VM root@bhyvetest:~# zpool online -e bob c2t0d0 root@bhyvetest:~# zpool list bob NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT bob 21.5G 1.10M 21.5G - - 0% 0% 1.00x ONLINE -
Updated by Andy Fiddaman 3 months ago
- Related to Feature #14471: vioblk could raise dynamic lun expansion sysevents added
Updated by Electric Monk 3 months ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit c2cd3a449cfa117e3a164f66931fa6c26c762945
commit c2cd3a449cfa117e3a164f66931fa6c26c762945 Author: Andy Fiddaman <omnios@citrus-it.co.uk> Date: 2022-02-18T18:29:52.000Z 14022 zpool online -e breaks access to pool Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Jason King <jason.brian.king+illumos@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com>
Updated by Joshua M. Clulow about 2 months ago
- Related to Bug #14593: zpool online -e hangs forever after 14022 added
Updated by Joshua M. Clulow about 2 months ago
- Related to Feature #14600: dkio(4I) needs an ioctl to re-read the disk label added