panic in metaslab_init due to space_map_open returning ENXIO
I managed to hit the following assertion:
assertion failed: space_map_open(&msp->ms_sm, mos, object, msp->ms_start, msp->ms_size, vd->vdev_ashift, &msp->ms_lock) 0 (0x6 0x0), file: ../../common/fs/zfs/metaslab.c, line: 1408
With the following stack trace:
metaslab_init+0x1bb(ffffff04fbbb3980, 2, 43, 0)
spa_load_impl+0xa2f(ffffff051135d000, 7fad683ba845ba11, ffffff04f319f018, 3, 0, 1)
spa_load+0x14e(ffffff051135d000, 3, 0, 1)
zfsdev_ioctl+0x4a7(1900000000, 5a06, 80423ec, 100003, ffffff04f04b4188, ffffff001e91fe68)
cdev_ioctl+0x39(1900000000, 5a06, 80423ec, 100003, ffffff04f04b4188, ffffff001e91fe68)
spec_ioctl+0x60(ffffff04ecbdd140, 5a06, 80423ec, 100003, ffffff04f04b4188, ffffff001e91fe68)
fop_ioctl+0x55(ffffff04ecbdd140, 5a06, 80423ec, 100003, ffffff04f04b4188, ffffff001e91fe68)
ioctl+0x9b(3, 5a06, 80423ec)
I will post a fix to better handle this error case soon.
Updated by George Wilson over 6 years ago
This problem can occur if an invalid/stale pool configuration is passed to the kernel as part of an import attempt. The code today assumes (incorrectly) that by the time vdev_load() is invoked that the configuration has been deemed valid but that's not true. It's possible that a configuration could be passed to the kernel that has extra vdevs in the configuration. Those vdevs may have come from an old pool that existed but has since been recreated. The example from the crash dump:
ADDR STATE NAME
ffffff051135d000 ACTIVE $import
ADDR STATE AUX DESCRIPTION
ffffff0511971900 CANT_OPEN CORRUPT_DATA root
ffffff04ec444380 CANT_OPEN CORRUPT_DATA missing
ffffff04f7fcd680 CANT_OPEN CORRUPT_DATA missing
ffffff04f7fce940 CANT_OPEN CORRUPT_DATA missing
ffffff04f1a34640 CANT_OPEN CORRUPT_DATA missing
ffffff0510a28d40 CANT_OPEN CORRUPT_DATA missing
ffffff0510a280c0 HEALTHY - /dev/dsk/c2t5d0s0
ffffff05ee585940 HEALTHY - missing
ffffff05ee585300 HEALTHY - /dev/dsk/c2t8d0s0
Clearly, this pool configuration is a mess but the import was able to get far enough to call vdev_load() where it starts to load the metaslab associated with c2t5d0s0. This ends up trying to load the space_map for object 0x43 but that object doesn't exist resulting in a panic. A similar panic would occur if the object did exist but had a checksum error (again because it's stale).
In most case we would expect vdev_load() to just succeed as is seen in the example below:
- zpool import
action: The pool can be imported using its name or numeric identifier.
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
data UNAVAIL missing device
Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.
Nonetheless, it's not guaranteed that we will be able to load the space_map or even the metaslab array so we must handle errors and bubble them up to vdev_load() to set the pool state accordingly.
Updated by Electric Monk over 6 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit 1e9bd7ec42f2d3bf854c2da35310901194833267 Author: Prakash Surya <firstname.lastname@example.org> Date: 2014-11-23T03:09:00.000Z 5213 panic in metaslab_init due to space_map_open returning ENXIO Reviewed by: Matthew Ahrens <email@example.com> Reviewed by: George Wilson <firstname.lastname@example.org> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Reviewed by: Brian Behlendorf <email@example.com> Reviewed by: Steven Hartland <firstname.lastname@example.org> Approved by: Dan McDonald <email@example.com>