Project

General

Profile

Actions

Bug #4454

closed

ldi notifications trigger vdev_disk_free() without spa_config_lock()

Added by Franz Schober over 9 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2014-01-10
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

System Description

I am running a custom-compiled Smartos derivate (based on illumos-joyent 122a8174b71285518bd85f331726b936b11de455 2013/12/02)
on two Supermicro Servers with X9DR7-TF+ with 6 Supermicro SAS disk shelfs SC826BE26-R1K28LP for every system.
There is no SAS connection between the systems.

Configuration

Every disk shelf is connected dual-path to one LSI 9207e HBA controller,
the shelfs are SAS-chained from top to bottom for the first HBA, from bottom to top for the second HBA,
so every disk in every chassis is connected dual path. We use Seagate 2TB Constellation.ES3 disks and
2 Stec ZEUS Ram for ZIL. The systems are also exporting and importing LUNs via FC.

To archive disk shelf redundancy in case of one disk shelf dying completely,
we arranged a ZFS disk pool with 9 RAIDZ2 vdevs consisting of 6 disks,
with every disk of one RAIDZ2 vdev located in a different shelf.
There are also spare disks in the shelfs.

Test

To test this configuration, we switched off one disk shelf in the middle of both SAS-chains,
the multi-path degraded for all disks, every vdev degraded, the pool continued to work and used the spare disks in
the remaining 5 online disk shelfs.

This worked on the first of the two identical systems, on the second system a kernel panic occured,
the system was not able to write a dump (see picture).

On trying again the disk shelf shutdown after reboot on the second system with the same disk shelf,
everything worked as expected.

Thank you for your support!


Files

panic_disk_shelf_shutdown_sued.png (123 KB) panic_disk_shelf_shutdown_sued.png Franz Schober, 2014-01-10 12:04 PM
OS-7920.tar.gz (7.95 KB) OS-7920.tar.gz scripts used to reproduce Mike Gerdts, 2019-12-20 05:30 PM

Related issues

Related to illumos gate - Bug #12076: reproducible panic in zfs:abd_free_linearNew

Actions
Actions #1

Updated by Garrett D'Amore over 9 years ago

The backtrace is not for stock illumos.

However, my first guess is that this is coming about as a result of strategy() returning non-zero. That shouldn't not be possible with stock illumos using the sd driver. (I've just finished tracing through the code paths on this myself.)

Perhaps SmartOS has a modified version of ZFS (it must!), or perhaps you have some other vdev present in the system besides sd? (You're not layering ZFS on top of itself with volumes, are you?)

Actions #2

Updated by Franz Schober over 9 years ago

On the disk pool consisting of disks connected by dual-path SAS there are ZVOLs, exported as COMSTAR LUNs via fibre channel (emlxs in target mode).
These LUN's are imported via fcp as devices (emlxs in initiator mode), and a zpool mirror is build from one from the local exportet target, one from the remote system.
So there are two layers of ZFS in the system. All disk devices (SAS or FC) should be presented by sd.

zpool list -v
norddisk                                97,9T  2,85T  95,0T         -     2%  1.00x  ONLINE  -
  raidz2                                10,9T   325G  10,6T         -
    c0t5000C500575082EFd0                   -      -      -         -
    c0t5000C50057690CFFd0                   -      -      -         -
    c0t5000C500574FA357d0                   -      -      -         -
    c0t5000C5005745196Fd0                   -      -      -         -
    c0t5000C5005750511Fd0                   -      -      -         -
    c0t5000C50057508AF7d0                   -      -      -         -
  raidz2                                10,9T   324G  10,6T         -
    c0t5000C50057500113d0                   -      -      -         -
    c0t5000C500574FD207d0                   -      -      -         -
    c0t5000C50057502437d0                   -      -      -         -
    c0t5000C500576AD97Bd0                   -      -      -         -
    c0t5000C500576B062Bd0                   -      -      -         -
    c0t5000C500576AF7A3d0                   -      -      -         -
  raidz2                                10,9T   324G  10,6T         -
    c0t5000C50057507E87d0                   -      -      -         -
    c0t5000C500574FFE7Fd0                   -      -      -         -
    c0t5000C500576B0967d0                   -      -      -         -
    c0t5000C5005743D637d0                   -      -      -         -
    c0t5000C500575075D3d0                   -      -      -         -
    c0t5000C5005750259Bd0                   -      -      -         -
  raidz2                                10,9T   324G  10,6T         -
    c0t5000C50057505FBBd0                   -      -      -         -
    c0t5000C500576B06B3d0                   -      -      -         -
    c0t5000C500576ADB2Bd0                   -      -      -         -
    c0t5000C50057508F8Bd0                   -      -      -         -
    c0t5000C5005750752Fd0                   -      -      -         -
    c0t5000C500576B049Bd0                   -      -      -         -
  raidz2                                10,9T   325G  10,6T         -
    c0t5000C500576AFFE3d0                   -      -      -         -
    c0t5000C5005743CD8Fd0                   -      -      -         -
    c0t5000C500574FFA8Fd0                   -      -      -         -
    c0t5000C50057435CCBd0                   -      -      -         -
    c0t5000C500576AF25Fd0                   -      -      -         -
    c0t5000C50057435867d0                   -      -      -         -
  raidz2                                10,9T   324G  10,6T         -
    c0t5000C50057504723d0                   -      -      -         -
    c0t5000C50057412A77d0                   -      -      -         -
    c0t5000C500576B0733d0                   -      -      -         -
    c0t5000C5005750398Fd0                   -      -      -         -
    c0t5000C500574FCC2Bd0                   -      -      -         -
    c0t5000C500576B044Fd0                   -      -      -         -
  raidz2                                10,9T   323G  10,6T         -
    c0t5000C500575070B3d0                   -      -      -         -
    c0t5000C500576B0BC3d0                   -      -      -         -
    c0t5000C50057500043d0                   -      -      -         -
    c0t5000C500575051F7d0                   -      -      -         -
    c0t5000C500576B0357d0                   -      -      -         -
    c0t5000C5005742392Bd0                   -      -      -         -
  raidz2                                10,9T   325G  10,6T         -
    c0t5000C500574643C7d0                   -      -      -         -
    c0t5000C50057225793d0                   -      -      -         -
    c0t5000C50057423ED3d0                   -      -      -         -
    c0t5000C50057504D6Fd0                   -      -      -         -
    c0t5000C50057463553d0                   -      -      -         -
    c0t5000C50057689927d0                   -      -      -         -
  raidz2                                10,9T   324G  10,6T         -
    c0t5000C50057503963d0                   -      -      -         -
    c0t5000C500576B04C3d0                   -      -      -         -
    c0t5000C500576B010Bd0                   -      -      -         -
    c0t5000C500574FBF1Bd0                   -      -      -         -
    c0t5000C50057464923d0                   -      -      -         -
    c0t5000C500576B0F6Bd0                   -      -      -         -
  mirror                                7,44G  9,39M  7,43G         -
    c0t5000A72030089FCAd0                   -      -      -         -
    c0t5000A7203008BDFEd0                   -      -      -         -
nordp                                   31,2T   110G  31,1T         -     0%  1.00x  ONLINE  -
  mirror                                31,2T   110G  31,1T         -
    c0t600144F09082C029000052CC68420010d0      -      -      -         -
    c0t600144F09082C021000052CC68740004d0      -      -      -         -

stmfadm list-lu -v
LU Name: 600144F09082C029000052CC68420010
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/rdsk/norddisk/nord1_nordp_vol1
    View Entry Count  : 1
    Data File         : /dev/zvol/rdsk/norddisk/nord1_nordp_vol1
    Meta File         : not set
    Size              : 34634616274944
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : SUN
    Product ID        : COMSTAR
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Enabled
    Access State      : Active
LU Name: 600144F09082C029000052CC75B60011
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/rdsk/norddisk/nord1_suedp_vol1
    View Entry Count  : 1
    Data File         : /dev/zvol/rdsk/norddisk/nord1_suedp_vol1
    Meta File         : not set
    Size              : 34634616274944
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : SUN
    Product ID        : COMSTAR
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Enabled
    Access State      : Active

The ldi_strategy function in illumos-joyent driver_lyr.c looks just the same to me as in joyent-gate:

ldi_strategy(ldi_handle_t lh, struct buf *bp)
{
    struct ldi_handle    *handlep = (struct ldi_handle *)lh;
    dev_t            dev;

    if ((lh == NULL) || (bp == NULL))
        return (EINVAL);

    /* this entry point is only supported for cb devices */
    dev = handlep->lh_vp->v_rdev;
    if (!(handlep->lh_type & LH_CBDEV))
        return (ENOTSUP);

    bp->b_edev = dev;
    bp->b_dev = cmpdev(dev);
    return (bdev_strategy(bp));
}

Actions #3

Updated by Dan McDonald over 8 years ago

This has been also discovered by an OmniOS customer. Same stack trace:

> $c
vpanic()
0xfffffffffba88618()
vdev_disk_io_start+0x138(ffffff22600c6b08)
zio_vdev_io_start+0xaa(ffffff22600c6b08)
zio_execute+0x88(ffffff22600c6b08)
zio_nowait+0x21(ffffff22600c6b08)
vdev_probe+0x123(ffffff222b004000, ffffff2262622448)
zio_vdev_io_done+0xd6(ffffff2262622448)
zio_execute+0x88(ffffff2262622448)
taskq_thread+0x2d0(ffffff222ae21280)
thread_start+8()
> ::status
debugging crash dump vmcore.0 (64-bit) from storage3
operating system: 5.11 omnios-8c08411 (i86pc)
image uuid: b511ecb6-3b42-60b9-a297-bb6e99304bc6
panic message: 
assertion failed: ldi_strategy(dvd->vd_lh, bp) == 0, file: ../../common/fs/zfs/v
dev_disk.c, line: 816
dump content: kernel pages only
> 

And OmniOS does have some of the SmartOS ZFS-for-zones enhancements.

Our customer was NOT using zones, however. They were using global-only and serving out zvols via iSCSI.

Our customer's pools did have some faulty or removed drives, though:

> ::spa -v
ADDR                 STATE NAME                                                
ffffff21e93a9000    ACTIVE rpool

    ADDR             STATE     AUX          DESCRIPTION                        
    ffffff21e9392a80 HEALTHY   -            root
    ffffff21e9392540 HEALTHY   -              mirror
    ffffff21e9392000 HEALTHY   -                /dev/dsk/c7t0d0s0
    ffffff21e938fac0 HEALTHY   -                /dev/dsk/c7t1d0s0
ffffff21e9842000    ACTIVE vol-0003ff20-5ed5-4a79-83c3-3e0940e74b31

    ffffff222acf0000 DEGRADED  -            root
    ffffff222b004a80 HEALTHY   -              raidz
    ffffff222b03ea80 HEALTHY   -                /dev/dsk/c4t500A0751037984FFd0s0
    ffffff222b001040 HEALTHY   -                /dev/dsk/c4t500A0751037985ACd0s0
    ffffff222b001580 HEALTHY   -                /dev/dsk/c4t500A0751037985B1d0s0
    ffffff222b001ac0 HEALTHY   -                /dev/dsk/c4t500A0751037985B5d0s0
    ffffff222b004000 HEALTHY   -                /dev/dsk/c4t500A0751037985B7d0s0
    ffffff222b004540 HEALTHY   -                /dev/dsk/c4t500A07510379854Bd0s0
    ffffff222b00d040 DEGRADED  -              raidz
    ffffff222b007040 HEALTHY   -                /dev/dsk/c4t500A07510379853Cd0s0
    ffffff222b007580 HEALTHY   -                /dev/dsk/c4t500A07510379856Cd0s0
    ffffff222b007ac0 HEALTHY   -                /dev/dsk/c4t500A07510379850Cd0s0
    ffffff222b00a000 HEALTHY   -                /dev/dsk/c4t500A07510379856Dd0s0
    ffffff222b00a540 REMOVED   -                /dev/dsk/c4t500A07510379854Dd0s0
    ffffff222b00aa80 HEALTHY   -                /dev/dsk/c4t500A07510379859Dd0s0
    ffffff222b015580 HEALTHY   -              raidz
    ffffff222b00d580 HEALTHY   -                /dev/dsk/c4t500A07510379855Ed0s0
    ffffff222b00dac0 HEALTHY   -                /dev/dsk/c4t500A075103798501d0s0
    ffffff222b012000 HEALTHY   -                /dev/dsk/c4t500A075103798502d0s0
    ffffff222b012540 HEALTHY   -                /dev/dsk/c4t500A075103798507d0s0
    ffffff222b012a80 HEALTHY   -                /dev/dsk/c4t500A075103798509d0s0
    ffffff222b015040 HEALTHY   -                /dev/dsk/c4t500A075103798530d0s0
    ffffff222b01bac0 HEALTHY   -              raidz
    ffffff222b015ac0 HEALTHY   -                /dev/dsk/c4t500A075103798540d0s0
    ffffff222b018000 HEALTHY   -                /dev/dsk/c4t500A075103798567d0s0
    ffffff222b018540 HEALTHY   -                /dev/dsk/c4t500A075103798569d0s0
    ffffff222b018a80 HEALTHY   -                /dev/dsk/c4t500A075103798571d0s0
    ffffff222b01b040 HEALTHY   -                /dev/dsk/c4t500A075103798599d0s0
    ffffff222b01b580 HEALTHY   -                /dev/dsk/c4t500A075103798562d0s0
>

And finally, they reported not having seen this problem until r151010s (vs. r151010j), but there are no ZFS changes between the two.

A look at the "r151010" branch of illumos-omnios can show what all is in that ZFS code.

I'm adding a Joyent engineer, and a ZFS engineer to see if any of this looks familiar. My speculation (and it's only speculation) is that a faulty drive tickled a failure that was assumed would never fail.

Actions #4

Updated by Dan McDonald about 6 years ago

This bug popped up on the OmniOS mailing list today:

http://lists.omniti.com/pipermail/omnios-discuss/2017-March/008457.html

Actions #5

Updated by Mike Gerdts over 3 years ago

This is bug is linked by https://smartos.org/bugview/OS-2743, which seems to be a dup of https://smartos.org/bugview/OS-7920. I think I have diagnosed the root cause and have a potential fix that is described in OS-7920.

Actions #6

Updated by Mike Gerdts over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Mike Gerdts
  • Tags deleted (needs-triage)
Actions #7

Updated by Mike Gerdts over 3 years ago

  • Subject changed from Kernel panic on switching off SAS disk shelf, assertion failed: ldi_strategy to ldi notifications trigger vdev_disk_free() without spa_config_lock()
Actions #8

Updated by Mike Gerdts over 3 years ago

Dan McDonald wrote:

This bug popped up on the OmniOS mailing list today:

http://lists.omniti.com/pipermail/omnios-discuss/2017-March/008457.html

That link is busted, but this one seems to describe the situation quite completely. https://web.archive.org/web/20140618152126/http://comments.gmane.org/gmane.os.illumos.zfs/3905

Actions #9

Updated by Mike Gerdts over 3 years ago

  • Related to Bug #12076: reproducible panic in zfs:abd_free_linear added
Actions #10

Updated by Mike Gerdts over 3 years ago

  • % Done changed from 0 to 90
Actions #12

Updated by Jason King over 3 years ago

For testing, the zfs test suite was run. The only failures were known failures with existing issues filed on them.

In addition, for specific testing of the bug, Mike performed the following testing using iSCSI:

Manual testing

I found the easiest way to trigger this code path was with iscsi. Offlining an LU at the target can trigger this code path.
In the setup described below, the target and initiator are on two different systems. While it may be possible to have them on the same system, this could lead to same-kernel zfs-on-zfs deadlocks.
The iscsi target and initiator settings do not survive reboot on SmartOS. After a reboot, the svcadm, itadm, stmfadm, and iscsiadm setup commands will need to be repeated. After they are executed, any pool that had previously existed on the LUs (but not destroyed) may be imported.

target setup
[root@cn2 (coal-1) ~]# svcadm enable -r iscsi/target
[root@cn2 (coal-1) ~]# itadm create-target
Target iqn.2010-08.org.illumos:02:55bbfe58-53aa-4921-bae5-eda0ea7e2f33 successfully created
[root@cn2 (coal-1) ~]# zfs create zones/zfstest
[root@cn2 (coal-1) ~]# zfs set compression=on zones/zfstest
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun0
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun1
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun2

[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun0
Logical unit created: 600144F00C9D700000005DF0EC0B0001
[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun1
Logical unit created: 600144F00C9D700000005DF0EC0D0002
[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun2
Logical unit created: 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# stmfadm list-lu
LU Name: 600144F00C9D700000005DF0EC0B0001
LU Name: 600144F00C9D700000005DF0EC0D0002
LU Name: 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0B0001
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# stmfadm list-lu | awk '{print "stmfadm add-view", $NF}' | sh -ex
+ stmfadm add-view 600144F00C9D700000005DF0EC0B0001
+ stmfadm add-view 600144F00C9D700000005DF0EC0D0002
+ stmfadm add-view 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
e1000g0: flags=1001004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,FIXEDMTU> mtu 1500 index 2
        inet 10.99.99.40 netmask ffffff00 broadcast 10.99.99.255
        ether 52:54:0:48:bb:8c
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        inet6 ::1/128

initiator setup
[root@cn3 (coal-1) ~]# svcadm enable -rs iscsi/initiator
[root@cn3 (coal-1) ~]# iscsiadm modify discovery -t enable
[root@cn3 (coal-1) ~]# iscsiadm add discovery-address 10.99.99.40

[root@cn3 (coal-1) ~]# devfsadm
Could not open /etc/saf/zsmon/_pmtab
[root@cn3 (coal-1) ~]# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c0t600144F00C9D700000005DF0EC0B0001d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
          /scsi_vhci/disk@g600144f00c9d700000005df0ec0b0001
       1. c0t600144F00C9D700000005DF0EC0D0002d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
          /scsi_vhci/disk@g600144f00c9d700000005df0ec0d0002
       2. c0t600144F00C9D700000005DF0EC0E0003d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
          /scsi_vhci/disk@g600144f00c9d700000005df0ec0e0003
       3. c1d0 <QEMU HARDDISK=QM00001-QM00001-0001-40.00GB>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@0,0
       4. c2t0d0 <Virtio-Block Device-0000-1.00GB>
          /pci@0,0/pci1af4,2@9/blkdev@0,0
       5. c3t0d0 <Virtio-Block Device-0000-1.00GB>
          /pci@0,0/pci1af4,2@a/blkdev@0,0
Specify disk (enter its number): ^D

[root@cn3 (coal-1) ~]# zpool create iscsi \
   mirror c0t600144F00C9D700000005DF0EC0B0001d0 c0t600144F00C9D700000005DF0EC0D0002d0 \
   spare c0t600144F00C9D700000005DF0EC0E0003d0

Exercise on debug build
The initiator system is running the 20191211T012722Z debug build. I use four windows:
1. On the initiator CN, run ./watch-OS-7920 (the dtrace script mentioned above)
2. On the initiator CN, run while true; do dd if=/dev/zero of=/iscsi/file bs=1024k count=100
3. On the initiator CN, run other commands
4. On the target CN, disable and enable an LU
Start the watcher in window 1 and the dd loop in window2. Wait for dd to report stats at least once.
In window 4 (target system), offline one of the LUs that makes up the mirror.
[root@cn2 (coal-1) ~]# stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002

In window 1, notice that the changed code path was hit:
[root@cn3 (coal-1) /var/tmp/OS-7920]# ./watch-OS-7920
dtrace: script './watch-OS-7920' matched 3 probes

CPU     ID                    FUNCTION:NAME
  0  20147           ldi_ev_get_type:return 2019 Dec 11 13:28:41 LDI:EVENT:OFFLINE
              zfs`vdev_disk_off_notify+0x37
              genunix`ldi_invoke_notify+0x11a
              genunix`e_ddi_offline_notify+0x8b
              genunix`devi_detach_node+0x89
              genunix`ndi_devi_offline+0x13d
              genunix`i_mdi_pi_state_change+0x5d9
              genunix`mdi_pi_offline+0x3d
              iscsi`iscsi_lun_offline+0xa6
              iscsi`iscsi_lun_destroy+0x43
              iscsi`iscsi_sess_reportluns+0x3cc
              iscsi`iscsi_sess_enumeration+0xb6
              genunix`taskq_thread+0x315
              unix`thread_start+0xb

  3  20147           ldi_ev_get_type:return 2019 Dec 11 13:28:43 LDI:EVENT:OFFLINE
              zfs`vdev_disk_off_finalize+0x3a
              genunix`ldi_invoke_finalize+0x132
              genunix`e_ddi_offline_finalize+0x79
              genunix`devi_detach_node+0x1d7
              genunix`ndi_devi_offline+0x13d
              genunix`i_mdi_pi_state_change+0x5d9
              genunix`mdi_pi_offline+0x3d
              iscsi`iscsi_lun_offline+0xa6
              iscsi`iscsi_lun_destroy+0x43
              iscsi`iscsi_sess_reportluns+0x3cc
              iscsi`iscsi_sess_enumeration+0xb6
              genunix`taskq_thread+0x315
              unix`thread_start+0xb

In window 3, observe that the spare was added to the mirror.
[root@cn3 (coal-1) ~]# zpool status
  pool: iscsi
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019
config:

        NAME                                         STATE     READ WRITE CKSUM
        iscsi                                        DEGRADED     0     0     0
          mirror-0                                   DEGRADED     0     0     0
            c0t600144F00C9D700000005DF0EC0B0001d0    ONLINE       0     0     0
            spare-1                                  DEGRADED     0     0     0
              c0t600144F00C9D700000005DF0EC0D0002d0  REMOVED      0     0     0
              c0t600144F00C9D700000005DF0EC0E0003d0  ONLINE       0     0     0
        spares
          c0t600144F00C9D700000005DF0EC0E0003d0      INUSE     currently in use

errors: No known data errors
...

In window 4, online the LU
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002

In window 3, observe that the pool returns to its initial configuration
[root@cn3 (coal-1) ~]# zpool status
  pool: iscsi
 state: ONLINE
  scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019
config:

        NAME                                         STATE     READ WRITE CKSUM
        iscsi                                        ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            c0t600144F00C9D700000005DF0EC0B0001d0    ONLINE       0     0     0
            spare-1                                  ONLINE       0     0     0
              c0t600144F00C9D700000005DF0EC0D0002d0  ONLINE       0     0     0
              c0t600144F00C9D700000005DF0EC0E0003d0  ONLINE       0     0     0
        spares
          c0t600144F00C9D700000005DF0EC0E0003d0      INUSE     currently in use

errors: No known data errors
...

[root@cn3 (coal-1) ~]# zpool status
  pool: iscsi
 state: ONLINE
  scan: resilvered 333K in 0 days 00:00:01 with 0 errors on Wed Dec 11 13:29:17 2019
config:

        NAME                                       STATE     READ WRITE CKSUM
        iscsi                                      ONLINE       0     0     0
          mirror-0                                 ONLINE       0     0     0
            c0t600144F00C9D700000005DF0EC0B0001d0  ONLINE       0     0     0
            c0t600144F00C9D700000005DF0EC0D0002d0  ONLINE       0     0     0
        spares
          c0t600144F00C9D700000005DF0EC0E0003d0    AVAIL

errors: No known data errors
...

After doing this once manually, I put the offline/online into a loop:
[root@cn2 (coal-1) ~]# (set -ex; while true; do stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002; sleep 15; stmfadm online-lu 600144F00C9D700000005DF0EC0D0002; sleep 45; done)
+ true
+ stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 15
+ stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 45
+ true
+ stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 15
+ stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 45
...

Also, I watched the pool cycle through its various stages of self-healing with:
[root@cn3 (coal-1) ~]# while true; do date; zpool status iscsi; sleep 10; done

Actions #13

Updated by Jason King over 3 years ago

From Joyent OS-7920:

There have been several occurrences of panics involving the following stack at about the same time as disk errors are logged. The first two arguments to ldi_ioctl vary. The first is always a garbage pointer and the second is a valid DKIO.

ffffd003d5e7b990 ldi_ioctl+0x2e(5ca96e63, 425, ffffd003d5e7b9e8, 80000000, ffffd0c4fcf1ae18, 0)
ffffd003d5e7ba60 vdev_disk_open+0x4f1(ffffd0c578140680, ffffd003d5e7ba98, ffffd003d5e7ba90, ffffd003d5e7ba88)
ffffd003d5e7bad0 vdev_open+0xdd(ffffd0c578140680)
ffffd003d5e7bb00 vdev_reopen+0x3b(ffffd0c578140680)
ffffd003d5e7bb40 spa_async_probe+0x8a(ffffd0c571f6a000, ffffd0c578140680)
ffffd003d5e7bb80 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c578142640)
ffffd003d5e7bbc0 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c5766a7d40)
ffffd003d5e7bc20 spa_async_thread+0x281(ffffd0c571f6a000)
ffffd003d5e7bc30 thread_start+8()

There are three calls to ldi_ioctl in vdev_disk_open:

335         vdev_disk_alloc(vd);
336         dvd = vd->vdev_tsd;
...
568         if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFOEXT,
...
573         } else if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFO,
...
581         } else {
...
586         }
...
590         if (vd->vdev_wholedisk == 1) {
...
607                 (void) ldi_ioctl(dvd->vd_lh, DKIOCSETWCE, (intptr_t)&wce,
608                     FKIOCTL, kcred, NULL);
609         }

In this case, we are calling ldi_ioctl(<bad_address>, DKIOSETWCE, ...), the call at line 607. In order to get here, we've made at least one other call to ldi_ioctl(dvd->vd_lh, ...) with a valid address in dvd->vd_lh. This appears to be a use-after-free, and is supported by:
> ffffd0c578140680::print vdev_t vdev_tsd
vdev_tsd = 0

The only place that vdev_tsd is set to NULL is in vdev_disk_free, which is called only by:
   1    172  usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_off_finalize>>
             vdev_disk_free(vd);
   2    325  usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_open>>
             vdev_disk_free(vd);
   3    653  usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_close>>
             vdev_disk_free(vd);

vdev_disk_off_finalize is only called as an ldi callback. vdev_disk_open is not suspect as a race involving itself would be blocked by spa_config_lock taken higher in the stack. vdev_disk_close may be called via a vdev_disk_ops callback or via an ldi callback. The ldi callbacks are quite suspicious.
184 static ldi_ev_callback_t vdev_disk_off_callb = {
185         .cb_vers = LDI_EV_CB_VERS,
186         .cb_notify = vdev_disk_off_notify,
187         .cb_finalize = vdev_disk_off_finalize
188 };

Neither of those callbacks call spa_config_enter, allowing their manipulation of the vdev_disk to race with vdev_disk_open.

Actions #14

Updated by Jason King over 3 years ago

Also:

Another one, however, I can confidently say is caused by this.

> ::status
debugging crash dump vmcore.7 (64-bit) from omni-2
operating system: 5.11 omni-physio-0-g59a9ce0e70 (i86pc)
build version: gfx-drm - heads/master-0-gbdc58b1-dirty

image uuid: b1d960bc-afbe-43c0-eea2-a975cb6312b2
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0008a99710 addr=ffffff094ae64778
dump content: kernel pages only
> $C
ffffff0008a99820 bdev_strategy+0x26(ffffff029ee775c0)
ffffff0008a99850 ldi_strategy+0x47(ffffff03086d9600, ffffff029ee775c0)
ffffff0008a998c0 vdev_disk_io_start+0x254(ffffff02733b45b8)
ffffff0008a99920 zio_vdev_io_start+0xdf(ffffff02733b45b8)
ffffff0008a99960 zio_execute+0xf5(ffffff02733b45b8)
ffffff0008a99990 zio_nowait+0x87(ffffff02733b45b8)
ffffff0008a999d0 vdev_queue_io_done+0x92(ffffff030f504648)
ffffff0008a99a10 zio_vdev_io_done+0xc8(ffffff030f504648)
ffffff0008a99a50 zio_execute+0xf5(ffffff030f504648)
ffffff0008a99b00 taskq_thread+0x315(ffffff03086e0ea0)
ffffff0008a99b10 thread_start+0xb()
>

This was purportedly fixed in OS-7797, but my workspace has that fix. It seems to have just tightened the race. Here we can see that the OS-7797 fix is present (failure of ldi_strategy() now calls zio_interrupt()) we still ran into the disk being offline and vd_lh being NULL.
> vdev_disk_io_start+0x254::dis
...
vdev_disk_io_start+0x24f:       call   +0x442223c       <ldi_strategy>
vdev_disk_io_start+0x254:       testl  %eax,%eax
vdev_disk_io_start+0x256:       je     -0xea    <vdev_disk_io_start+0x172>
vdev_disk_io_start+0x25c:       jmp    -0x101   <vdev_disk_io_start+0x160>
...
> vdev_disk_io_start+0x160::dis
...
vdev_disk_io_start+0x160:       movl   $0x6,0x31c(%rbx)
vdev_disk_io_start+0x16a:       movq   %rbx,%rdi
vdev_disk_io_start+0x16d:       call   -0x14cb2 <zio_interrupt>
...
> ffffff02733b45b8::print zio_t io_vd | ::print vdev_t vdev_tsd | ::print vdev_disk_t
{
    vd_devid = 0
    vd_minor = 0
    vd_lh = 0
    vd_ldi_cbs = {
        list_size = 0x18
        list_offset = 0
        list_head = {
            list_next = 0xffffff026e2ad008
            list_prev = 0xffffff18277fc4b8
        }
    }
    vd_ldi_offline = 0x1 (B_TRUE)
}

That is:
 893 static void
 894 vdev_disk_io_start(zio_t *zio)
 895 {
 896         vdev_t *vd = zio->io_vd;
 897         vdev_disk_t *dvd = vd->vdev_tsd;
 898         unsigned long trim_flags = 0;
 899         vdev_buf_t *vb;
 900         struct dk_callback *dkc;
 901         buf_t *bp;
 902         int error;
 903
 904         /*
 905          * If the vdev is closed, it's likely in the REMOVED or FAULTED state.
 906          * Nothing to be done here but return failure.
 907          */
 908         if (dvd == NULL || (dvd->vd_ldi_offline && dvd->vd_lh == NULL)) {
 909                 zio->io_error = ENXIO;
 910                 zio_interrupt(zio);
 911                 return;
 912         }

// KAPOW!  vdev_disk_off_notify() struck from another thread

Actions #15

Updated by Electric Monk over 3 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

git commit 1b500975aaacf8b5d0e18c9a117bf5560069ffc3

commit  1b500975aaacf8b5d0e18c9a117bf5560069ffc3
Author: Mike Gerdts <mike.gerdts@joyent.com>
Date:   2020-01-31T17:43:17.000Z

    4454 ldi notifications trigger vdev_disk_free() without spa_config_lock()
    Portions contributed by: Jason King <jason.king@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF