Bug #4454
closedldi notifications trigger vdev_disk_free() without spa_config_lock()
100%
Description
System Description¶
I am running a custom-compiled Smartos derivate (based on illumos-joyent 122a8174b71285518bd85f331726b936b11de455 2013/12/02)
on two Supermicro Servers with X9DR7-TF+ with 6 Supermicro SAS disk shelfs SC826BE26-R1K28LP for every system.
There is no SAS connection between the systems.
Configuration¶
Every disk shelf is connected dual-path to one LSI 9207e HBA controller,
the shelfs are SAS-chained from top to bottom for the first HBA, from bottom to top for the second HBA,
so every disk in every chassis is connected dual path. We use Seagate 2TB Constellation.ES3 disks and
2 Stec ZEUS Ram for ZIL. The systems are also exporting and importing LUNs via FC.
To archive disk shelf redundancy in case of one disk shelf dying completely,
we arranged a ZFS disk pool with 9 RAIDZ2 vdevs consisting of 6 disks,
with every disk of one RAIDZ2 vdev located in a different shelf.
There are also spare disks in the shelfs.
Test¶
To test this configuration, we switched off one disk shelf in the middle of both SAS-chains,
the multi-path degraded for all disks, every vdev degraded, the pool continued to work and used the spare disks in
the remaining 5 online disk shelfs.
This worked on the first of the two identical systems, on the second system a kernel panic occured,
the system was not able to write a dump (see picture).
On trying again the disk shelf shutdown after reboot on the second system with the same disk shelf,
everything worked as expected.
Thank you for your support!
Files
Related issues
Updated by Garrett D'Amore over 9 years ago
The backtrace is not for stock illumos.
However, my first guess is that this is coming about as a result of strategy() returning non-zero. That shouldn't not be possible with stock illumos using the sd driver. (I've just finished tracing through the code paths on this myself.)
Perhaps SmartOS has a modified version of ZFS (it must!), or perhaps you have some other vdev present in the system besides sd? (You're not layering ZFS on top of itself with volumes, are you?)
Updated by Franz Schober over 9 years ago
On the disk pool consisting of disks connected by dual-path SAS there are ZVOLs, exported as COMSTAR LUNs via fibre channel (emlxs in target mode).
These LUN's are imported via fcp as devices (emlxs in initiator mode), and a zpool mirror is build from one from the local exportet target, one from the remote system.
So there are two layers of ZFS in the system. All disk devices (SAS or FC) should be presented by sd.
zpool list -v norddisk 97,9T 2,85T 95,0T - 2% 1.00x ONLINE - raidz2 10,9T 325G 10,6T - c0t5000C500575082EFd0 - - - - c0t5000C50057690CFFd0 - - - - c0t5000C500574FA357d0 - - - - c0t5000C5005745196Fd0 - - - - c0t5000C5005750511Fd0 - - - - c0t5000C50057508AF7d0 - - - - raidz2 10,9T 324G 10,6T - c0t5000C50057500113d0 - - - - c0t5000C500574FD207d0 - - - - c0t5000C50057502437d0 - - - - c0t5000C500576AD97Bd0 - - - - c0t5000C500576B062Bd0 - - - - c0t5000C500576AF7A3d0 - - - - raidz2 10,9T 324G 10,6T - c0t5000C50057507E87d0 - - - - c0t5000C500574FFE7Fd0 - - - - c0t5000C500576B0967d0 - - - - c0t5000C5005743D637d0 - - - - c0t5000C500575075D3d0 - - - - c0t5000C5005750259Bd0 - - - - raidz2 10,9T 324G 10,6T - c0t5000C50057505FBBd0 - - - - c0t5000C500576B06B3d0 - - - - c0t5000C500576ADB2Bd0 - - - - c0t5000C50057508F8Bd0 - - - - c0t5000C5005750752Fd0 - - - - c0t5000C500576B049Bd0 - - - - raidz2 10,9T 325G 10,6T - c0t5000C500576AFFE3d0 - - - - c0t5000C5005743CD8Fd0 - - - - c0t5000C500574FFA8Fd0 - - - - c0t5000C50057435CCBd0 - - - - c0t5000C500576AF25Fd0 - - - - c0t5000C50057435867d0 - - - - raidz2 10,9T 324G 10,6T - c0t5000C50057504723d0 - - - - c0t5000C50057412A77d0 - - - - c0t5000C500576B0733d0 - - - - c0t5000C5005750398Fd0 - - - - c0t5000C500574FCC2Bd0 - - - - c0t5000C500576B044Fd0 - - - - raidz2 10,9T 323G 10,6T - c0t5000C500575070B3d0 - - - - c0t5000C500576B0BC3d0 - - - - c0t5000C50057500043d0 - - - - c0t5000C500575051F7d0 - - - - c0t5000C500576B0357d0 - - - - c0t5000C5005742392Bd0 - - - - raidz2 10,9T 325G 10,6T - c0t5000C500574643C7d0 - - - - c0t5000C50057225793d0 - - - - c0t5000C50057423ED3d0 - - - - c0t5000C50057504D6Fd0 - - - - c0t5000C50057463553d0 - - - - c0t5000C50057689927d0 - - - - raidz2 10,9T 324G 10,6T - c0t5000C50057503963d0 - - - - c0t5000C500576B04C3d0 - - - - c0t5000C500576B010Bd0 - - - - c0t5000C500574FBF1Bd0 - - - - c0t5000C50057464923d0 - - - - c0t5000C500576B0F6Bd0 - - - - mirror 7,44G 9,39M 7,43G - c0t5000A72030089FCAd0 - - - - c0t5000A7203008BDFEd0 - - - - nordp 31,2T 110G 31,1T - 0% 1.00x ONLINE - mirror 31,2T 110G 31,1T - c0t600144F09082C029000052CC68420010d0 - - - - c0t600144F09082C021000052CC68740004d0 - - - - stmfadm list-lu -v LU Name: 600144F09082C029000052CC68420010 Operational Status: Online Provider Name : sbd Alias : /dev/zvol/rdsk/norddisk/nord1_nordp_vol1 View Entry Count : 1 Data File : /dev/zvol/rdsk/norddisk/nord1_nordp_vol1 Meta File : not set Size : 34634616274944 Block Size : 512 Management URL : not set Vendor ID : SUN Product ID : COMSTAR Serial Num : not set Write Protect : Disabled Writeback Cache : Enabled Access State : Active LU Name: 600144F09082C029000052CC75B60011 Operational Status: Online Provider Name : sbd Alias : /dev/zvol/rdsk/norddisk/nord1_suedp_vol1 View Entry Count : 1 Data File : /dev/zvol/rdsk/norddisk/nord1_suedp_vol1 Meta File : not set Size : 34634616274944 Block Size : 512 Management URL : not set Vendor ID : SUN Product ID : COMSTAR Serial Num : not set Write Protect : Disabled Writeback Cache : Enabled Access State : Active
The ldi_strategy function in illumos-joyent driver_lyr.c looks just the same to me as in joyent-gate:
ldi_strategy(ldi_handle_t lh, struct buf *bp) { struct ldi_handle *handlep = (struct ldi_handle *)lh; dev_t dev; if ((lh == NULL) || (bp == NULL)) return (EINVAL); /* this entry point is only supported for cb devices */ dev = handlep->lh_vp->v_rdev; if (!(handlep->lh_type & LH_CBDEV)) return (ENOTSUP); bp->b_edev = dev; bp->b_dev = cmpdev(dev); return (bdev_strategy(bp)); }
Updated by Dan McDonald over 8 years ago
This has been also discovered by an OmniOS customer. Same stack trace:
> $c vpanic() 0xfffffffffba88618() vdev_disk_io_start+0x138(ffffff22600c6b08) zio_vdev_io_start+0xaa(ffffff22600c6b08) zio_execute+0x88(ffffff22600c6b08) zio_nowait+0x21(ffffff22600c6b08) vdev_probe+0x123(ffffff222b004000, ffffff2262622448) zio_vdev_io_done+0xd6(ffffff2262622448) zio_execute+0x88(ffffff2262622448) taskq_thread+0x2d0(ffffff222ae21280) thread_start+8() > ::status debugging crash dump vmcore.0 (64-bit) from storage3 operating system: 5.11 omnios-8c08411 (i86pc) image uuid: b511ecb6-3b42-60b9-a297-bb6e99304bc6 panic message: assertion failed: ldi_strategy(dvd->vd_lh, bp) == 0, file: ../../common/fs/zfs/v dev_disk.c, line: 816 dump content: kernel pages only >
And OmniOS does have some of the SmartOS ZFS-for-zones enhancements.
Our customer was NOT using zones, however. They were using global-only and serving out zvols via iSCSI.
Our customer's pools did have some faulty or removed drives, though:
> ::spa -v ADDR STATE NAME ffffff21e93a9000 ACTIVE rpool ADDR STATE AUX DESCRIPTION ffffff21e9392a80 HEALTHY - root ffffff21e9392540 HEALTHY - mirror ffffff21e9392000 HEALTHY - /dev/dsk/c7t0d0s0 ffffff21e938fac0 HEALTHY - /dev/dsk/c7t1d0s0 ffffff21e9842000 ACTIVE vol-0003ff20-5ed5-4a79-83c3-3e0940e74b31 ffffff222acf0000 DEGRADED - root ffffff222b004a80 HEALTHY - raidz ffffff222b03ea80 HEALTHY - /dev/dsk/c4t500A0751037984FFd0s0 ffffff222b001040 HEALTHY - /dev/dsk/c4t500A0751037985ACd0s0 ffffff222b001580 HEALTHY - /dev/dsk/c4t500A0751037985B1d0s0 ffffff222b001ac0 HEALTHY - /dev/dsk/c4t500A0751037985B5d0s0 ffffff222b004000 HEALTHY - /dev/dsk/c4t500A0751037985B7d0s0 ffffff222b004540 HEALTHY - /dev/dsk/c4t500A07510379854Bd0s0 ffffff222b00d040 DEGRADED - raidz ffffff222b007040 HEALTHY - /dev/dsk/c4t500A07510379853Cd0s0 ffffff222b007580 HEALTHY - /dev/dsk/c4t500A07510379856Cd0s0 ffffff222b007ac0 HEALTHY - /dev/dsk/c4t500A07510379850Cd0s0 ffffff222b00a000 HEALTHY - /dev/dsk/c4t500A07510379856Dd0s0 ffffff222b00a540 REMOVED - /dev/dsk/c4t500A07510379854Dd0s0 ffffff222b00aa80 HEALTHY - /dev/dsk/c4t500A07510379859Dd0s0 ffffff222b015580 HEALTHY - raidz ffffff222b00d580 HEALTHY - /dev/dsk/c4t500A07510379855Ed0s0 ffffff222b00dac0 HEALTHY - /dev/dsk/c4t500A075103798501d0s0 ffffff222b012000 HEALTHY - /dev/dsk/c4t500A075103798502d0s0 ffffff222b012540 HEALTHY - /dev/dsk/c4t500A075103798507d0s0 ffffff222b012a80 HEALTHY - /dev/dsk/c4t500A075103798509d0s0 ffffff222b015040 HEALTHY - /dev/dsk/c4t500A075103798530d0s0 ffffff222b01bac0 HEALTHY - raidz ffffff222b015ac0 HEALTHY - /dev/dsk/c4t500A075103798540d0s0 ffffff222b018000 HEALTHY - /dev/dsk/c4t500A075103798567d0s0 ffffff222b018540 HEALTHY - /dev/dsk/c4t500A075103798569d0s0 ffffff222b018a80 HEALTHY - /dev/dsk/c4t500A075103798571d0s0 ffffff222b01b040 HEALTHY - /dev/dsk/c4t500A075103798599d0s0 ffffff222b01b580 HEALTHY - /dev/dsk/c4t500A075103798562d0s0 >
And finally, they reported not having seen this problem until r151010s (vs. r151010j), but there are no ZFS changes between the two.
A look at the "r151010" branch of illumos-omnios can show what all is in that ZFS code.
I'm adding a Joyent engineer, and a ZFS engineer to see if any of this looks familiar. My speculation (and it's only speculation) is that a faulty drive tickled a failure that was assumed would never fail.
Updated by Dan McDonald about 6 years ago
This bug popped up on the OmniOS mailing list today:
http://lists.omniti.com/pipermail/omnios-discuss/2017-March/008457.html
Updated by Mike Gerdts over 3 years ago
This is bug is linked by https://smartos.org/bugview/OS-2743, which seems to be a dup of https://smartos.org/bugview/OS-7920. I think I have diagnosed the root cause and have a potential fix that is described in OS-7920.
Updated by Mike Gerdts over 3 years ago
- Status changed from New to In Progress
- Assignee set to Mike Gerdts
- Tags deleted (
needs-triage)
Updated by Mike Gerdts over 3 years ago
- Subject changed from Kernel panic on switching off SAS disk shelf, assertion failed: ldi_strategy to ldi notifications trigger vdev_disk_free() without spa_config_lock()
Updated by Mike Gerdts over 3 years ago
Dan McDonald wrote:
This bug popped up on the OmniOS mailing list today:
http://lists.omniti.com/pipermail/omnios-discuss/2017-March/008457.html
That link is busted, but this one seems to describe the situation quite completely. https://web.archive.org/web/20140618152126/http://comments.gmane.org/gmane.os.illumos.zfs/3905
Updated by Mike Gerdts over 3 years ago
- Related to Bug #12076: reproducible panic in zfs:abd_free_linear added
Updated by Mike Gerdts over 3 years ago
- File OS-7920.tar.gz OS-7920.tar.gz added
Updated by Jason King over 3 years ago
For testing, the zfs test suite was run. The only failures were known failures with existing issues filed on them.
In addition, for specific testing of the bug, Mike performed the following testing using iSCSI:
Manual testing¶
I found the easiest way to trigger this code path was with iscsi. Offlining an LU at the target can trigger this code path.
In the setup described below, the target and initiator are on two different systems. While it may be possible to have them on the same system, this could lead to same-kernel zfs-on-zfs deadlocks.
The iscsi target and initiator settings do not survive reboot on SmartOS. After a reboot, the svcadm, itadm, stmfadm, and iscsiadm setup commands will need to be repeated. After they are executed, any pool that had previously existed on the LUs (but not destroyed) may be imported.
target setup
[root@cn2 (coal-1) ~]# svcadm enable -r iscsi/target
[root@cn2 (coal-1) ~]# itadm create-target
Target iqn.2010-08.org.illumos:02:55bbfe58-53aa-4921-bae5-eda0ea7e2f33 successfully created
[root@cn2 (coal-1) ~]# zfs create zones/zfstest
[root@cn2 (coal-1) ~]# zfs set compression=on zones/zfstest
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun0
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun1
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun2
[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun0
Logical unit created: 600144F00C9D700000005DF0EC0B0001
[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun1
Logical unit created: 600144F00C9D700000005DF0EC0D0002
[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun2
Logical unit created: 600144F00C9D700000005DF0EC0E0003
[root@cn2 (coal-1) ~]# stmfadm list-lu
LU Name: 600144F00C9D700000005DF0EC0B0001
LU Name: 600144F00C9D700000005DF0EC0D0002
LU Name: 600144F00C9D700000005DF0EC0E0003
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0B0001
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0E0003
[root@cn2 (coal-1) ~]# stmfadm list-lu | awk '{print "stmfadm add-view", $NF}' | sh -ex
+ stmfadm add-view 600144F00C9D700000005DF0EC0B0001
+ stmfadm add-view 600144F00C9D700000005DF0EC0D0002
+ stmfadm add-view 600144F00C9D700000005DF0EC0E0003
[root@cn2 (coal-1) ~]# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1001004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,FIXEDMTU> mtu 1500 index 2
inet 10.99.99.40 netmask ffffff00 broadcast 10.99.99.255
ether 52:54:0:48:bb:8c
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128
initiator setup
[root@cn3 (coal-1) ~]# svcadm enable -rs iscsi/initiator
[root@cn3 (coal-1) ~]# iscsiadm modify discovery -t enable
[root@cn3 (coal-1) ~]# iscsiadm add discovery-address 10.99.99.40
[root@cn3 (coal-1) ~]# devfsadm
Could not open /etc/saf/zsmon/_pmtab
[root@cn3 (coal-1) ~]# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0t600144F00C9D700000005DF0EC0B0001d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
/scsi_vhci/disk@g600144f00c9d700000005df0ec0b0001
1. c0t600144F00C9D700000005DF0EC0D0002d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
/scsi_vhci/disk@g600144f00c9d700000005df0ec0d0002
2. c0t600144F00C9D700000005DF0EC0E0003d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
/scsi_vhci/disk@g600144f00c9d700000005df0ec0e0003
3. c1d0 <QEMU HARDDISK=QM00001-QM00001-0001-40.00GB>
/pci@0,0/pci-ide@1,1/ide@0/cmdk@0,0
4. c2t0d0 <Virtio-Block Device-0000-1.00GB>
/pci@0,0/pci1af4,2@9/blkdev@0,0
5. c3t0d0 <Virtio-Block Device-0000-1.00GB>
/pci@0,0/pci1af4,2@a/blkdev@0,0
Specify disk (enter its number): ^D
[root@cn3 (coal-1) ~]# zpool create iscsi \
mirror c0t600144F00C9D700000005DF0EC0B0001d0 c0t600144F00C9D700000005DF0EC0D0002d0 \
spare c0t600144F00C9D700000005DF0EC0E0003d0
Exercise on debug build
The initiator system is running the 20191211T012722Z debug build. I use four windows:
1. On the initiator CN, run ./watch-OS-7920 (the dtrace script mentioned above)
2. On the initiator CN, run while true; do dd if=/dev/zero of=/iscsi/file bs=1024k count=100
3. On the initiator CN, run other commands
4. On the target CN, disable and enable an LU
Start the watcher in window 1 and the dd loop in window2. Wait for dd to report stats at least once.
In window 4 (target system), offline one of the LUs that makes up the mirror.
[root@cn2 (coal-1) ~]# stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
In window 1, notice that the changed code path was hit:
[root@cn3 (coal-1) /var/tmp/OS-7920]# ./watch-OS-7920
dtrace: script './watch-OS-7920' matched 3 probes
CPU ID FUNCTION:NAME
0 20147 ldi_ev_get_type:return 2019 Dec 11 13:28:41 LDI:EVENT:OFFLINE
zfs`vdev_disk_off_notify+0x37
genunix`ldi_invoke_notify+0x11a
genunix`e_ddi_offline_notify+0x8b
genunix`devi_detach_node+0x89
genunix`ndi_devi_offline+0x13d
genunix`i_mdi_pi_state_change+0x5d9
genunix`mdi_pi_offline+0x3d
iscsi`iscsi_lun_offline+0xa6
iscsi`iscsi_lun_destroy+0x43
iscsi`iscsi_sess_reportluns+0x3cc
iscsi`iscsi_sess_enumeration+0xb6
genunix`taskq_thread+0x315
unix`thread_start+0xb
3 20147 ldi_ev_get_type:return 2019 Dec 11 13:28:43 LDI:EVENT:OFFLINE
zfs`vdev_disk_off_finalize+0x3a
genunix`ldi_invoke_finalize+0x132
genunix`e_ddi_offline_finalize+0x79
genunix`devi_detach_node+0x1d7
genunix`ndi_devi_offline+0x13d
genunix`i_mdi_pi_state_change+0x5d9
genunix`mdi_pi_offline+0x3d
iscsi`iscsi_lun_offline+0xa6
iscsi`iscsi_lun_destroy+0x43
iscsi`iscsi_sess_reportluns+0x3cc
iscsi`iscsi_sess_enumeration+0xb6
genunix`taskq_thread+0x315
unix`thread_start+0xb
In window 3, observe that the spare was added to the mirror.
[root@cn3 (coal-1) ~]# zpool status
pool: iscsi
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019
config:
NAME STATE READ WRITE CKSUM
iscsi DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c0t600144F00C9D700000005DF0EC0B0001d0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
c0t600144F00C9D700000005DF0EC0D0002d0 REMOVED 0 0 0
c0t600144F00C9D700000005DF0EC0E0003d0 ONLINE 0 0 0
spares
c0t600144F00C9D700000005DF0EC0E0003d0 INUSE currently in use
errors: No known data errors
...
In window 4, online the LU
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
In window 3, observe that the pool returns to its initial configuration
[root@cn3 (coal-1) ~]# zpool status
pool: iscsi
state: ONLINE
scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019
config:
NAME STATE READ WRITE CKSUM
iscsi ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t600144F00C9D700000005DF0EC0B0001d0 ONLINE 0 0 0
spare-1 ONLINE 0 0 0
c0t600144F00C9D700000005DF0EC0D0002d0 ONLINE 0 0 0
c0t600144F00C9D700000005DF0EC0E0003d0 ONLINE 0 0 0
spares
c0t600144F00C9D700000005DF0EC0E0003d0 INUSE currently in use
errors: No known data errors
...
[root@cn3 (coal-1) ~]# zpool status
pool: iscsi
state: ONLINE
scan: resilvered 333K in 0 days 00:00:01 with 0 errors on Wed Dec 11 13:29:17 2019
config:
NAME STATE READ WRITE CKSUM
iscsi ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t600144F00C9D700000005DF0EC0B0001d0 ONLINE 0 0 0
c0t600144F00C9D700000005DF0EC0D0002d0 ONLINE 0 0 0
spares
c0t600144F00C9D700000005DF0EC0E0003d0 AVAIL
errors: No known data errors
...
After doing this once manually, I put the offline/online into a loop:
[root@cn2 (coal-1) ~]# (set -ex; while true; do stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002; sleep 15; stmfadm online-lu 600144F00C9D700000005DF0EC0D0002; sleep 45; done)
+ true
+ stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 15
+ stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 45
+ true
+ stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 15
+ stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 45
...
Also, I watched the pool cycle through its various stages of self-healing with:
[root@cn3 (coal-1) ~]# while true; do date; zpool status iscsi; sleep 10; done
Updated by Jason King over 3 years ago
From Joyent OS-7920:
There have been several occurrences of panics involving the following stack at about the same time as disk errors are logged. The first two arguments to ldi_ioctl
vary. The first is always a garbage pointer and the second is a valid DKIO.
ffffd003d5e7b990 ldi_ioctl+0x2e(5ca96e63, 425, ffffd003d5e7b9e8, 80000000, ffffd0c4fcf1ae18, 0) ffffd003d5e7ba60 vdev_disk_open+0x4f1(ffffd0c578140680, ffffd003d5e7ba98, ffffd003d5e7ba90, ffffd003d5e7ba88) ffffd003d5e7bad0 vdev_open+0xdd(ffffd0c578140680) ffffd003d5e7bb00 vdev_reopen+0x3b(ffffd0c578140680) ffffd003d5e7bb40 spa_async_probe+0x8a(ffffd0c571f6a000, ffffd0c578140680) ffffd003d5e7bb80 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c578142640) ffffd003d5e7bbc0 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c5766a7d40) ffffd003d5e7bc20 spa_async_thread+0x281(ffffd0c571f6a000) ffffd003d5e7bc30 thread_start+8()
There are three calls to ldi_ioctl
in vdev_disk_open
:
335 vdev_disk_alloc(vd); 336 dvd = vd->vdev_tsd; ... 568 if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFOEXT, ... 573 } else if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFO, ... 581 } else { ... 586 } ... 590 if (vd->vdev_wholedisk == 1) { ... 607 (void) ldi_ioctl(dvd->vd_lh, DKIOCSETWCE, (intptr_t)&wce, 608 FKIOCTL, kcred, NULL); 609 }
In this case, we are calling
ldi_ioctl(<bad_address>, DKIOSETWCE, ...)
, the call at line 607. In order to get here, we've made at least one other call to ldi_ioctl(dvd->vd_lh, ...)
with a valid address in dvd->vd_lh
. This appears to be a use-after-free, and is supported by:> ffffd0c578140680::print vdev_t vdev_tsd vdev_tsd = 0
The only place that vdev_tsd is set to NULL is in vdev_disk_free, which is called only by:
1 172 usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_off_finalize>> vdev_disk_free(vd); 2 325 usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_open>> vdev_disk_free(vd); 3 653 usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_close>> vdev_disk_free(vd);
vdev_disk_off_finalize is only called as an ldi callback. vdev_disk_open is not suspect as a race involving itself would be blocked by spa_config_lock taken higher in the stack. vdev_disk_close may be called via a vdev_disk_ops callback or via an ldi callback. The ldi callbacks are quite suspicious.
184 static ldi_ev_callback_t vdev_disk_off_callb = { 185 .cb_vers = LDI_EV_CB_VERS, 186 .cb_notify = vdev_disk_off_notify, 187 .cb_finalize = vdev_disk_off_finalize 188 };
Neither of those callbacks call spa_config_enter, allowing their manipulation of the vdev_disk to race with vdev_disk_open.
Updated by Jason King over 3 years ago
Also:
Another one, however, I can confidently say is caused by this.
> ::status debugging crash dump vmcore.7 (64-bit) from omni-2 operating system: 5.11 omni-physio-0-g59a9ce0e70 (i86pc) build version: gfx-drm - heads/master-0-gbdc58b1-dirty image uuid: b1d960bc-afbe-43c0-eea2-a975cb6312b2 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0008a99710 addr=ffffff094ae64778 dump content: kernel pages only > $C ffffff0008a99820 bdev_strategy+0x26(ffffff029ee775c0) ffffff0008a99850 ldi_strategy+0x47(ffffff03086d9600, ffffff029ee775c0) ffffff0008a998c0 vdev_disk_io_start+0x254(ffffff02733b45b8) ffffff0008a99920 zio_vdev_io_start+0xdf(ffffff02733b45b8) ffffff0008a99960 zio_execute+0xf5(ffffff02733b45b8) ffffff0008a99990 zio_nowait+0x87(ffffff02733b45b8) ffffff0008a999d0 vdev_queue_io_done+0x92(ffffff030f504648) ffffff0008a99a10 zio_vdev_io_done+0xc8(ffffff030f504648) ffffff0008a99a50 zio_execute+0xf5(ffffff030f504648) ffffff0008a99b00 taskq_thread+0x315(ffffff03086e0ea0) ffffff0008a99b10 thread_start+0xb() >
This was purportedly fixed in OS-7797, but my workspace has that fix. It seems to have just tightened the race. Here we can see that the OS-7797 fix is present (failure of
ldi_strategy()
now calls zio_interrupt()
) we still ran into the disk being offline and vd_lh
being NULL
.> vdev_disk_io_start+0x254::dis ... vdev_disk_io_start+0x24f: call +0x442223c <ldi_strategy> vdev_disk_io_start+0x254: testl %eax,%eax vdev_disk_io_start+0x256: je -0xea <vdev_disk_io_start+0x172> vdev_disk_io_start+0x25c: jmp -0x101 <vdev_disk_io_start+0x160> ... > vdev_disk_io_start+0x160::dis ... vdev_disk_io_start+0x160: movl $0x6,0x31c(%rbx) vdev_disk_io_start+0x16a: movq %rbx,%rdi vdev_disk_io_start+0x16d: call -0x14cb2 <zio_interrupt> ... > ffffff02733b45b8::print zio_t io_vd | ::print vdev_t vdev_tsd | ::print vdev_disk_t { vd_devid = 0 vd_minor = 0 vd_lh = 0 vd_ldi_cbs = { list_size = 0x18 list_offset = 0 list_head = { list_next = 0xffffff026e2ad008 list_prev = 0xffffff18277fc4b8 } } vd_ldi_offline = 0x1 (B_TRUE) }
That is:
893 static void 894 vdev_disk_io_start(zio_t *zio) 895 { 896 vdev_t *vd = zio->io_vd; 897 vdev_disk_t *dvd = vd->vdev_tsd; 898 unsigned long trim_flags = 0; 899 vdev_buf_t *vb; 900 struct dk_callback *dkc; 901 buf_t *bp; 902 int error; 903 904 /* 905 * If the vdev is closed, it's likely in the REMOVED or FAULTED state. 906 * Nothing to be done here but return failure. 907 */ 908 if (dvd == NULL || (dvd->vd_ldi_offline && dvd->vd_lh == NULL)) { 909 zio->io_error = ENXIO; 910 zio_interrupt(zio); 911 return; 912 } // KAPOW! vdev_disk_off_notify() struck from another thread
Updated by Electric Monk over 3 years ago
- Status changed from In Progress to Closed
- % Done changed from 90 to 100
git commit 1b500975aaacf8b5d0e18c9a117bf5560069ffc3
commit 1b500975aaacf8b5d0e18c9a117bf5560069ffc3 Author: Mike Gerdts <mike.gerdts@joyent.com> Date: 2020-01-31T17:43:17.000Z 4454 ldi notifications trigger vdev_disk_free() without spa_config_lock() Portions contributed by: Jason King <jason.king@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Dan McDonald <danmcd@joyent.com>