Project

General

Profile

Bug #12237

zpool coredump, vdev_open() returns EDOM with disk image file (but not zvol) from bhyve

Added by James Blachly 6 months ago. Updated 16 days ago.

Status:
New
Priority:
High
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Background:
I was unable to install OmniOS or OpenIndiana (did not further attempt SmartOS but presume same) in a VM with FreeBSD Host and bhyve hypervisor when using a disk image backed by a file. I was later successful when using a zvol-backed disk image. Some part of the kernel sees different properties for the two types of backings, even though it is not apparent in diagnostics available to me.

When using the file-backed disk image, the automated installers errored, and when I dropped to shell to run zpool create manually, we see the following:

root@openindiana:/root# diskinfo
TYPE    DISK                    VID      PID              SIZE          RMV SSD
-       c3t0d0                  Virtio   Block Device       20.00 GiB   no  no
root@openindiana:/root# zpool create rpool c3t0d0
internal error: Argument out of domain
Abort (core dumped)
root@openindiana:/root# mdb core
Loading modules: [ libumem.so.1 libc.so.1 libtopo.so.1 libavl.so.1 libnvpair.so.1 libsysevent.so.1 libuutil.so.1 libcmd
utils.so.1 ld.so.1 ]
> ::status
debugging core file of zpool (32-bit) from openindiana
file: /sbin/zpool
initial argv: zpool create rpool c3t0d0
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=1862 uid=0 code=-1
> ::stack
libc.so.1`_lwp_kill+0x15(1, 6, 8042198, fef49000, fef49000, 8093970)
libc.so.1`raise+0x2b(6)
libc.so.1`abort+0xed()
libzfs.so.1`zfs_verror+0xca(8093548, 827, fedcafa1, 804225c)
libzfs.so.1`zpool_standard_error_fmt+0x3a8(8093548, 21, fedcafa1)
libzfs.so.1`zpool_standard_error+0x27(8093548, 21, 80422c8)
libzfs.so.1`zpool_create+0x5a0(8093548, 8047ebd, 8090f88, 8090a28, 0)
zpool_do_create+0xaa1(3, 8047ddc)
main+0xd9(8047d6c, fef564a8)
_start_crt+0x97(4, 8047dd8, fefcfee4, 0, 0, 0)
_start+0x1a(4, 8047eb0, 8047eb6, 8047ebd, 8047ec3, 0)
dtrace -n 'vdev_open:return { trace(arg1) }' -c "zpool create ..."
shows `vdev_open:return 33`, which suggests this block in vdev_open is culprit:

https://grok.elemental.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev.c?r=4d7988d6#1689

HOWEVER, if I pass bhyve a disk image backed by a zvol, all works well in Illumos.

I have narrowed this down to a difference in optimal block size perceived by zfs in a virtio-blk device when backed [via bhyve] by a file, versus a zvol. So, I realize this is sort of a shared-bug, and needs to be reported to both projects. I'm afraid I lack the tools or know-how to do further analysis in a boot environment. I did not see any relevant kstat. `prtconf -v` output looks the same for the file-backed and zvol-backed devices, and both have the device-blksize property of 0x200.

Trying to add the file-backed disk (c3t0d0 in this example) to the zvol-backed disk produces a warning providing further support for hypothesis that that vdev_t->asize may be perceived differently:
(here, c5t0d0 is the zvol-backed disk and c3t0d0 is the file-backed disk)

root@openindiana:/root# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c5t0d0    ONLINE       0     0     0

errors: No known data errors
root@openindiana:/root# zpool attach rpool c5t0d0 c3t0d0
cannot attach c3t0d0 to c5t0d0: new device has a different optimal sector size; use the option '-o ashift=N' to override the optimal size

If Illumos community can help me figure out what is going on, I'll be able also to make a better bug report to bhyve project.

Thanks in advance

History

#1

Updated by James Blachly 6 months ago

prtpicl -v may show the difference between the two.

the zvol backed device has the property :lba-access-ok, while the file-backed disk does not.

#3

Updated by Hans Rosenfeld 5 months ago

Can you please report what version of FreeBSD you are running and what your bhyve VM configuration looks like?

#4

Updated by James Blachly 5 months ago

Hans:

12.1-RELEASE-p1

I am using churchers/vm-bhyve (https://github.com/churchers/vm-bhyve) to manage launching. The command line parameters are as follows from the logs:

Jan 28 21:27:55:  [bhyve options: -c 1 -m 1G -Hwl bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd -U ded534db-423e-11ea-91bf-002590f30d1a -u]
Jan 28 21:27:55:  [bhyve devices: -s 0,hostbridge -s 31,lpc -s 4:0,virtio-blk,/vm/oi-test/disk0.img -s 5:0,virtio-net,tap10,mac=58:9c:fc:00:02:e3]
Jan 28 21:27:55:  [bhyve console: -l com1,stdio]
Jan 28 21:27:55:  [bhyve iso device: -s 3:0,ahci-cd,/vm/.iso/OI-hipster-text-20191106.iso,ro]

As indicated before, this only happens when the VM is backed by a file. I'll also file a FreeBSD issue although I am not enough of a storage expert to know whether or not they'll consider it an error to not set the LBA flag on a file.

#5

Updated by Hans Rosenfeld 5 months ago

Thanks, I was able to reproduce and debug this.

When the disk is backed by a file, bhyve reports a 128k physical block size, which results in ashift=17, which is larger than ASHIFT_MAX. There are really two issues here:

First, bhyve is just using the "optimal I/O block size" as returned by stat(2) as the physical block size, which in this case is the ZFS record size of the underlying file system, which I'm not sure it should do without any sanity checking. If the disk file is located on a different file system like tmpfs the resulting ashift changes to something more sensible like 12 due to the smaller record size. This needs to be fixed in bhyve.

Second, zpool(1m) should give a helpful error message for EDOM, explaining what is wrong.

As a workaround you can manually create the pool and override the detected ashift value:

# zpool create -o ashift=12 rpool c3t0d0

#6

Updated by James Blachly 5 months ago

Thanks.

Do you have any other insight into the lack of " :lba-access-ok" flag on file-backed virtio-block? Clearly that was a red-herring with respect to this particular bug, but now that it is noted, is this something that should also be addressed in bhyve? (relevant, since freebsd bhyve is now going to be upstream of a bhyve effort in illumos, I understand)

Finally, is there any way using the diagnostic tools available to the sysadmin (prtpicl, prtconf, etc.) to see the reported physical block size? When I issued prtconf -v I a m fairly certain that still reported 0x200 for the device block size (can't recall the property name right now)

#7

Updated by James Blachly 5 months ago

property `device-blksize` is evidently the logical block size , in this case again from the underlying filesystem and 0x200 512b.

It would be nice if illumos exposed the physical block size from the topology struct in virtioblk; otherwise there is probably no way to have debugged this without being able to do whatever Hans did to see the data structure.

#8

Updated by Andy Fiddaman 5 months ago

The OmniOS installer provides the option to force the ashift on a pool - I'd be interested if you could confirm that this lets you get it installed as a guest under these conditions. It's the Force sector size option in the ZFS Root Pool Configuration menu.

#9

Updated by Hans Rosenfeld 5 months ago

The lba-access-ok flag is really only an ancient piece of history.

It wouldn't be unreasonable to add reporting blocksizes to diskinfo(1M). Perhaps you can file a feature request for that?

Regarding the issue in bhyve, I'm not sure I want to make any changes to its use of st_blksize (from stat(2)). ZFS is a bit special in that it can use really large record sizes, and that is intentional. Of course bhyve could use an arbitrary lower number if it's too large, but which value should that be? It could be anything from 4k to 64k, and there isn't an obvious candidate that would be "right" or at "least wrong". On other filesystems the value from st_blksize may or may not make sense, but on ZFS you really should always use a ZVOL with a sensible volblocksize for disk images.

#10

Updated by Hans Rosenfeld 5 months ago

ZFS change for a better error message: https://code.illumos.org/c/illumos-gate/+/347

#11

Updated by Electric Monk 16 days ago

  • Gerrit CR set to 347

Also available in: Atom PDF