Project

General

Profile

Bug #4051

Don't free mdi_pathinfo_t in mptsas when device(s) are retired.

Added by Thibault VINCENT about 6 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
kernel
Start date:
2013-08-17
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

The crash happens when MPXIO is enabled and any ioctl reaches mpapi, but it is closely related to hardware especially the SAS topology.

Platform
  • OmniOS stable or bloody
  • Dell R720xd server
  • 2 x LSI 9207-8e HBAs = 4 x 6Gb ports
  • 2 x Supermicro SC847E26-RJBOD1 JBODs (each has multiple LSI expanders)
  • Up to 90 Seagate ST4000NM0023 drives and STEC s840

I doubt you can easily reproduce with other hardware, but during tests I've seen it getting worse if a Dell H310 or H710 controller is involved for system disks (LSI 2008 based, mr_sas). Also it did not happen on a Dell R510 (PCI-e 2 slots but same HBA).

Steps
  1. A single enclosure with 45 drives is connected to the first HBA on a single port
  2. OmniOS bloody is installed on the first drive
  3. Server boots correctly on the first enclosure drive
  4. MPXIO is enabled with "stmsboot -D mpt_sas -e" and server rebooted
  5. System is stable, "mpathadm list lu" reports a single path to all drives
  6. The second enclosure cable can be plugged to any remaining HBA port, unpluged, etc... , it's rock solid and "mpathadm list lu" is consistent with reality
  7. A second empty enclosure is added, system is still stable
  8. A first disk is engaged in the second enclosure, it's stable and drive appears in "list lu"
  9. A second drive is engaged in that enclosure, then it crashes when mpathadm issues an ioctl

This is just an example, the bottom line is I can't have two enclosures in the topology when MPXIO is enabled. Crash happen whatever the number of path is, with only few drives, and even when daisy chained one to the other (single port server-side).

Also as said earlier, having an internal Dell H310/H710 controller and the associated backplane with an expander was making things worse. It could crash with only one external enclosure on LSI cards. But the R510 server also had this and could handle everything.

Dump
ftp://ftp.zeroloop.net/bugs/illumos/crash.16.xz
ftp://ftp.zeroloop.net/bugs/illumos/unix.16.xz
ftp://ftp.zeroloop.net/bugs/illumos/vmdump.16.xz
ftp://ftp.zeroloop.net/bugs/illumos/vmcore.16.xz

Usual trace (may not match above dump)

Aug 14 08:21:34 san-1 unix: [ID 836849 kern.notice] 
Aug 14 08:21:34 san-1 ^Mpanic[cpu3]/thread=ffffff430fd83060: 
Aug 14 08:21:34 san-1 genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ffffff01e9ee09a0 addr=2c occurred in module "genunix" due to a NULL pointer dereference
Aug 14 08:21:34 san-1 unix: [ID 100000 kern.notice] 
Aug 14 08:21:34 san-1 unix: [ID 839527 kern.notice] mpathadm: 
Aug 14 08:21:34 san-1 unix: [ID 753105 kern.notice] #pf Page fault
Aug 14 08:21:34 san-1 unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x2c
Aug 14 08:21:34 san-1 unix: [ID 243837 kern.notice] pid=9495, pc=0xfffffffffbaecb98, sp=0xffffff01e9ee0a90, eflags=0x10286
Aug 14 08:21:34 san-1 unix: [ID 211416 kern.notice] cr0: 80050033<pg,wp,ne,et,mp,pe> cr4: 426f8<osxsav,vmxe,xmme,fxsr,pge,mce,pae,pse,de>
Aug 14 08:21:34 san-1 unix: [ID 624947 kern.notice] cr2: 2c
Aug 14 08:21:34 san-1 unix: [ID 625075 kern.notice] cr3: 40309c6000
Aug 14 08:21:34 san-1 unix: [ID 625715 kern.notice] cr8: 0
Aug 14 08:21:34 san-1 unix: [ID 100000 kern.notice] 
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]     rdi:                0 rsi:                4 rdx: ffffff4351a28540
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]     rcx:              224  r8:           100005  r9: ffffff42e149b8a8
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]     rax: ffffff4351a283c0 rbx: ffffff430fa043c8 rbp: ffffff01e9ee0aa0
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]     r10: fffffffffb854144 r11:                0 r12: ffffff4352a32cc0
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]     r13: ffffff4362364500 r14: ffffff42e1466680 r15:               2d
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]     fsb:                0 gsb: ffffff430e1e0a80  ds:               4b
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]      es:               4b  fs:                0  gs:              1c3
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]     trp:                e err:                0 rip: fffffffffbaecb98
Aug 14 08:21:34 san-1 unix: [ID 592667 kern.notice]      cs:               30 rfl:            10286 rsp: ffffff01e9ee0a90
Aug 14 08:21:34 san-1 unix: [ID 266532 kern.notice]      ss:               38
Aug 14 08:21:34 san-1 unix: [ID 100000 kern.notice] 
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0880 unix:die+df ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0990 unix:trap+db3 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee09a0 unix:cmntrap+e6 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0aa0 genunix:ddi_get_instance+8 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0af0 scsi_vhci:vhci_mpapi_sync_lu_oid_list+5d ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0b80 scsi_vhci:vhci_mpapi_ioctl+a8 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0c40 scsi_vhci:vhci_mpapi_ctl+e3 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0c80 scsi_vhci:vhci_ioctl+5d ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0cc0 genunix:cdev_ioctl+39 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0d10 specfs:spec_ioctl+60 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0da0 genunix:fop_ioctl+55 ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0ec0 genunix:ioctl+9b ()
Aug 14 08:21:34 san-1 genunix: [ID 655072 kern.notice] ffffff01e9ee0f10 unix:brand_sys_sysenter+1c9 ()
Aug 14 08:21:34 san-1 unix: [ID 100000 kern.notice] 
Aug 14 08:21:34 san-1 genunix: [ID 672855 kern.notice] syncing file systems...
Aug 14 08:21:35 san-1 genunix: [ID 904073 kern.notice]  done
Aug 14 08:21:34 san-1 genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Aug 14 08:22:20 san-1 genunix: [ID 100000 kern.notice] 
Aug 14 08:22:20 san-1 genunix: [ID 665016 kern.notice] ^M100% done: 2875835 pages dumped, 
Aug 14 08:22:20 san-1 genunix: [ID 851671 kern.notice] dump succeeded


Related issues

Related to illumos gate - Bug #4085: mpathadm list lu cause panicClosed2013-08-28

Actions

History

#1

Updated by Thibault VINCENT about 6 years ago

I've been doing lots of tests with other parts.

Now it is clear the crash is not related to :
  • Dell servers : because an old HP server does the same
  • LSI 9207-8e or 8i which is based on 2308 chips : 2008 based controllers also make OmniOS crash
  • A deep SAS topology : keeping a singler layer of expander doesn't help
This didn't help :
  • using any 12th generation Dell server we have : R320, R520, R720xd
  • using a previous generation Dell R510 : unlike I said it does crash, test was incomplete
  • using an old HP server with the same LSI 9207-8e HBAs
  • using a Dell H200 IT mode controller (LSI 2008) both in R720xd's or the HP server
  • updating enclosure expanders firmware to v. 55.14.18.0
  • downgrading ST4000NM0023 drives to firmware 002
  • using SAS switches in between HBAs and enclosures (LSI SAS6160)
  • killing internal chaining in enclosures, by pluggin all expanders directly to HBAs or SAS switch

What I can't do is use an other brand/model of enclosures. Switching to other drives would also be complicated.

New crash dump with the R510 and simpler setup (one HBA, one path to one SAS switch) :
ftp://ftp.zeroloop.net/bugs/illumos/unix.28.xz
ftp://ftp.zeroloop.net/bugs/illumos/vmcore.28.xz
ftp://ftp.zeroloop.net/bugs/illumos/vmdump.28.xz

Is anyone able to troubleshoot with the files ? Would you need a remote SSH + KVM access ?
As far as I can tell, in all situations it's always the same bug where a lun struct gets a NULL vlun member.

Thanks

#2

Updated by Thibault VINCENT about 6 years ago

Since I got switches I'm covering more cases and here's an interesting one : with two enclosures (45 slots and 4 expanders each) but a single ST4000NM0023 drive it crashes !

Doing all this with Orcale Solaris 11 works, and I couldn't experience problems with Linux and Windows. Just saying this to confirm it can perform well with other systems and that's not a misuse of hardware or cable problem.

#3

Updated by Frank Penrose almost 6 years ago

I wish to "+1" this bug report.

We have a Dell PowerEdge R710 server with 3 Dell MD1200 disk trays with 2TB SATA drives (12 drives per tray) and an LSI 9200-8e controller plus the internal controller from Dell (I believe it is a Dell H310). We are experiencing the same problem described by Thibault VINCENT with periodic freezes/lockups. We can trigger a crash with "mpathadm list lu" .

In addition, we have weird behavior with two Toshiba 2TB disks in the system, when the other 34 Seagate disks respond normally. Here is an excerpt from diskmap.py:

0:07:05 c10t5000C5001049C12Bd0 ST32000444SS 2.0T Ready (RDY) engr-std3: raidz1-5
0:07:06 c11t5000039428C8E13Bd0 MK2001TRKB 2.0T Ready (RDY) engr-std3: raidz1-6
0:07:06 c11t5000039428C8E13Bd0 MK2001TRKB 2.0T Ready (RDY) engr-std3: raidz1-6
0:07:07 c10t5000C500104991DFd0 ST32000444SS 2.0T Ready (RDY) engr-std3: raidz1-6

Notice the two Toshiba disks (MK2001TRKB) show in the same position in the MD1200 (enclosure 7, slot 6).

Here is an excerpt of format:

6. c9t5000039428C8E13Ad0 &lt;TOSHIBA-MK2001TRKB-DCA8-1.82TB&gt;
/pci@0,0/pci8086,340e@7/pci1000,30b0@0/iport@f/disk@w5000039428c8e13a,0
41. c11t5000039428C8E13Bd0 &lt;TOSHIBA-MK2001TRKB-DCA8-1.82TB&gt;
/pci@0,0/pci8086,340e@7/pci1000,30b0@0/iport@f0/disk@w5000039428c8e13b,0

that shows the two Toshiba disks. We also continually get reports of

genunix: [ID 408114 kern.info] /scsi_vhci/disk@g5000c50010468ba7 (sd58) offline

For a disk that as far as I can tell, never existed in the system (we have had a g5000c50010468ba6 in the system, but we removed and detached it, but it did not come back properly).

We're going to reinstall and try oi_151a8 but some progress on this bug would be appreciated. We will attach dumps once we have them.

#4

Updated by Thibault VINCENT almost 6 years ago

Hi Frank,
I hope you'll get some attention with this bug, that was a show stopper for us and the reason why we moved away from Illumos based distros.
Nevertheless if some progress is made and depending how quick it comes, maybe I'll be able to confirm a fix on the initial Supermicro setup. It'll be available for lab for about two more weeks, then will be reduced to one JBOD and only few disks. Remote access is still an option for a developer to test kernels.

#5

Updated by Chip Schweiss almost 6 years ago

I can confirm this is a problem when using Supermicro JBODs, in particular the 847E26-RJBOD1.

The problem does not occur on my systems with DataON JBODs with the same OmniOS version.

I have this problem on a DR system that has many hours a day that I can run tests on. If any developer would like to utilize the this resource for diagnosis/testing please contact me.

#6

Updated by Frank Penrose almost 6 years ago

  • Status changed from New to Feedback

Sorry for not updating sooner. We actually resolved our issue with multipathing in OpenIndiana. We were having a problem with one of the drive types (Toshiba) being different from all the rest (Seagate). Below are the notes from our administrator who fixed the issue:

Multipath issues
Issues with multipathing were by far the largest problem with this machine. STOR3 is the most unique of our storage arrays because it is connected to disk chassis (Dell MD 1200's) that host multiple controllers. This setup increases the redundancy of the array as a cable or controller could fail and not bring the storage down. Unfortunately it also increases the complexity, especially when using OpenIndiana.

Before the fix, running the command "mpathadm list lu" would always crash the server. In order to help debug the problem the following changes were made to the system config file.

Changes to /etc/system
1 * Added 12/5/2013 - fep to help with crash dumps hopefully.
2 * Don't use multi-threaded fast crash dump or a high compression level
3 *
4 set dump_plat_mincpu=0
5 set dump_bzip2_level=1

Those changes allowed us to view the dump files created by the crash. The information those contained led us to two articles:

https://www.illumos.org/issues/644
https://www.illumos.org/issues/4051

Using the information from those the scsi_vhci.conf files was edited with the following

Changes to scsi_vhci.conf (The Toshiba lines were most important for us)
01
02 scsi-vhci-failover-override =
03 "3PARdataVV", "f_sym",
04 "COMPELNTCompellent Vol", "f_sym",
05 "DGC VRAID", "f_asym_emc",
06 "HP HSV2", "f_sym",
07 "HP MB2000FAMYV", "f_sym", # HP, rebadged WD RE
08 "HP EG0600FBDSR", "f_sym", # HP, rebadged WD S25
09 "HP OPEN", "f_sym",
10 "NETAPP LUN", "f_sym",
11 "HITACHI DF600F", "f_sym",
12 "HITACHI HUS156060VLS600", "f_sym",
13 "HITACHI HUS723020ALS", "f_sym", # Hitachi Ultrastar 7K3000 SAS HDD
14 "OCZ TALOS", "f_sym", # OCZ Talos SSD
15 "Pliant LS150S", "f_sym", # Pliant SAS SSD
16 "Pliant LS300S", "f_sym", # Pliant SAS SSD
17 "STEC Z4", "f_sym", # ZeusIOPS
18 "STEC ZeusRAM", "f_sym", # ZeusRAM
19 "STEC Z16", "f_sym", # ZeusIOPS
20 "TOSHIBA MK1001TRKB", "f_sym", # Toshiba SAS HDD
21 "TOSHIBA MK2001TRKB", "f_sym", # Toshiba SAS HDD
22 "TOSHIBA MK1001GRZB", "f_sym", # Toshiba SAS SSD 100Gb
23 "TOSHIBA MK2001GRZB", "f_sym", # Toshiba SAS SSD 200Gb
24 "TOSHIBA MK4001GRZB", "f_sym", # Toshiba SAS SSD 400Gb
25 "WD WD4001FYYG", "f_sym"; # WD RE 4TB SAS HDD
26
27

After this change stmsboot -e was run and the system appeared stable. It is possible that stmsboot -e was never enabled which may have been part of our issue.

I then edited and unedited this file several times in order to try and recreate the crashing issue but the server would not fail.

stmsboot -e and -d after editing the scsi_vhci.conf file several times
Check to see if disks appears reappears
Check to see if zpool status changes
Check to see if disk paths change
Test mpathadm
I also cleaned up /dev/dsk with devfsadm

Important lessons learned!

Specific disks handle multipathing differently on OpenIndiana be sure to test them before adding them to a production environment.
If multipathing is not working properly, diskmap.py will list the disk twice in the same location for that particular disk.
If multipathing is not working properly, the disk name in zpool status will look different then the other properly working disks.
This may cause problems with the mpathadm tool.
ZFS handles multipathing changes well. I was able to enable and disable multipathing and ZFS would automatically update the drive locations in each zpool to avoid issues.

#7

Updated by Dan McDonald over 4 years ago

It is possible this is a duplicate of #4085, and of this fix in illumos-nexenta:

commit 7cf584fc964e0765e7e5caa3db56c13d249741ca
Author: Jeffry Molanus <>
Date: Tue May 13 05:49:12 2014 +0200

OS-253 we should not free mdi_pathinfo_t in mptsas when device(s) are retired

If anyone can reproduce this issue on OmniOS, please contact me.

#8

Updated by Dan McDonald over 4 years ago

Here's the analysis from #4085:

This has the same symptoms as an issue with retired devices with multiple paths we have identified and fixed, with the following explanation:
For each path in a multipath configuration, we have a pathinfo_t. It's
created in MDI_PATHINFO_STATE_INIT state (a "transient" state). For
the rest of its lifetime it's supposed to transition between either
OFFLINE or ONLINE depending on whether it is available.

In MDI, each pathinfo_t has a "client", which is the target device's
devinfo node. In mpt_sas's case, mptsas_create_virt_lun allocates a
pathinfo_t, and then tries to online the path via mdi_pi_online.

mdi_pi_online calls i_mdi_pi_state_change requesting
MDI_PATHINFO_STATE_ONLINE. This invokes the pHCI and vHCI specific
callbacks to set up the path. If it has a client and the driver is not
attached, it tries to invoke deferred attach via ndi_devi_online.
However, if ndi_devi_online fails, i_mdi_pi_state_change resets the
state of the pathinfo_t and also returns an error.

This is wrong because:
1) We have already invoked the callbacks to bring the path online.
Changing the path state flag to e.g. MDI_PATHINFO_STATE_OFFLINE (or
worse, back to MDI_PATHINFO_STATE_INIT) does not reflect the actual
state from the perspective of the pHCI/vHCI drivers.
2) Whether the driver could attach to the client should not affect
the availability of the path.

If the client device is retired, ndi_devi_online will fail, thus
sending us into a spiral of suffering.

#9

Updated by Dan McDonald over 4 years ago

  • Subject changed from crash in vhci_mpapi_sync_lu_oid_list() on ioctl, possible SAS problem to Don't free mdi_pathinfo_t in mptsas when device(s) are retired.

Change summary, original was "crash in vhci_mpapi_sync_lu_oid_list() on ioctl, possible SAS problem".

Changed it to reflect the fix from illumos-nexenta.

#10

Updated by Electric Monk about 4 years ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

git commit 11779b4caed6449da7f5cc8550aef185a2d06ba3

commit  11779b4caed6449da7f5cc8550aef185a2d06ba3
Author: Jeffry Molanus <jeffry.molanus@nexenta.com>
Date:   2015-07-28T18:10:00.000Z

    4051 Don't free mdi_pathinfo_t in mptsas when device(s) are retired.
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Reviewed by: Albert Lee <alee@omniti.com>
    Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>

Also available in: Atom PDF