Project

General

Profile

Actions

Feature #11380

closed

Support for Microsemi SmartPQI HBA

Added by Gordon Ross over 4 years ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:
racktop:BSR-13522

Description

Support for Microsemi SmartPQI HBA, found on some HP Enterprise hardware

Disk pull tests:
From Stuart Maybee:

We did on the version that was based on the initial merged version of our driver with the racktop version. However, none on the latest one in the pull request. Unfortunately, the test rig I was using is not available to me currently. (and I am deep in working on another driver so have not been able to put any time into smartpqi lately). If I can shake some time free I will see if I can set up a disk pull test but it may not be very soon.


Related issues

Related to illumos gate - Bug #15979: kernel heap corruption detected during disk pulls on HPE applianceClosedYuri Pankov

Actions
Actions #2

Updated by anil choudhary over 1 year ago

Hi Gordon Ross
we are also facing kernel panic I our deployment of OpenIndiana
operating system: 5.11 illumos-7bb0fe31b7 (i86pc)
build version: heads/master-0-g7bb0fe31b7-dirty

image uuid: bafb9eb3-fa1f-6942-b438-a5b6ec2a1cca
panic message: I/O to pool 'rpool' appears to be hung.
dump content: kernel pages only
(curproc requested, but a kernel thread panicked) vpanic()
vdev_deadman+0x108(fffffe235046a000)
vdev_deadman+0x43(fffffe235046d000)
spa_deadman+0xab(fffffe231ff62000)
cyclic_softint+0xe2(fffffffffbc4b000, 0)
cbe_low_level+0x16(0, 0)
av_dispatch_softvect+0x72(2)
apix_dispatch_softint+0x35(0, 0)
switch_sp_and_call+0x15()
apix_do_softint+0x60(fffffcc22bdc9f10)
apix_do_interrupt+0x298(fffffcc22bdc9f10, 1c3561670)
_interrupt+0xc3()

we found that in freebsd they have put some fix for same stack trace
following is the link

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240145

can we merge from freebsd

Thanks
Anil

Actions #3

Updated by Garrett D'Amore over 1 year ago

The driver posted here has numerous problems; many of them we (RackTop) have fixed.

There is a chance that I'll come back with a completely refactored and modernized driver soon.

Actions #4

Updated by Gordon Ross 8 months ago

Updated version of the "smartpqi" driver, (with work from Nexenta, Tintri, and RackTop) is here:
https://github.com/racktopsystems/illumos-gate/commits/smartpqi5

Note that there are twenty-something commits there to help keep straight what were all the changes the various contributors made. That should all be squashed and "portions contributed" lines added for the various authors.

If anyone has time to help test the code on this branch, please post your results in comments on this issue or on the developers mailing list. Thanks!

Actions #5

Updated by Gordon Ross 8 months ago

  • Assignee changed from Gordon Ross to Toomas Soome
  • External Bug set to racktop:BSR-13522
Actions #6

Updated by Electric Monk 8 months ago

  • Gerrit CR set to 2946
Actions #7

Updated by anil choudhary 8 months ago

we cam test it if you provide iso we have the hardware

Thanks,
Anil

Actions #8

Updated by Gordon Ross 8 months ago

Re FreeBSD bug 240145, I'm not sure that change has been integrated. Someone needs to look it over and see:
https://cgit.freebsd.org/src/commit/sys/dev/smartpqi?h=stable/13&id=1569aab1cb38a38fb619f343ed1e47d4b4070ffe

Actions #9

Updated by anil choudhary 8 months ago

I have successfully build the smartoqi driver code on test vagrant machine
using nightly script and step mention in wiki.

how can I build iso
root@openindiana:/code/illumos-gate# uname -a
SunOS openindiana 5.11 smartpqi5-0-g9b8bee96a3 i86pc i386 i86pc
root@openindiana:/code/illumos-gate#

Help would be appricited as I need to use iso to install on HPE blade

Thanks,
Anil

Actions #10

Updated by Toomas Soome 8 months ago

anil choudhary wrote in #note-9:

I have successfully build the smartoqi driver code on test vagrant machine
using nightly script and step mention in wiki.

how can I build iso
root@openindiana:/code/illumos-gate# uname -a
SunOS openindiana 5.11 smartpqi5-0-g9b8bee96a3 i86pc i386 i86pc
root@openindiana:/code/illumos-gate#

Help would be appricited as I need to use iso to install on HPE blade

Thanks,
Anil

with openindiana, it is done with distribution constructor . pkg install distribution-constructor. You can find sample manifests in /usr/share/distro_const tree. It may be faster/easier to transfer binary manually and use add_drv.

Actions #11

Updated by anil choudhary 8 months ago

Hi Toomas Soome
followed the instruction get following error
/usr/share/distro_const/DC-manifest.defval.xml validates
/tmp/slim_cd_x86_temp_101044.xml validates
Simple Log: /rpool/distro_const/dc-slim/logs/simple-log-2023-06-30-21-19-36
Detail Log: /rpool/distro_const/dc-slim/logs/detail-log-2023-06-30-21-19-36
Build started Fri Jun 30 21:19:36 2023
Distribution name: OpenIndiana_Live_X86
Build Area dataset: rpool/distro_const/dc-slim
Build Area mount point: /rpool/distro_const/dc-slim ==== im-pop: Image area creation
Traceback (most recent call last):
File "/usr/share/distro_const/im_pop.py", line 34, in <module>
from osol_install.libtransfer import TM_E_SUCCESS
ModuleNotFoundError: No module named 'osol_install.libtransfer'
Child returned err 1
Build completed Fri Jun 30 21:19:37 2023
Build failed.
root@openindiana:/code/illumos-gate/distro-const#

Actions #12

Updated by Toomas Soome 8 months ago

anil choudhary wrote in #note-11:

Hi Toomas Soome
followed the instruction get following error
/usr/share/distro_const/DC-manifest.defval.xml validates
/tmp/slim_cd_x86_temp_101044.xml validates
Simple Log: /rpool/distro_const/dc-slim/logs/simple-log-2023-06-30-21-19-36
Detail Log: /rpool/distro_const/dc-slim/logs/detail-log-2023-06-30-21-19-36
Build started Fri Jun 30 21:19:36 2023
Distribution name: OpenIndiana_Live_X86
Build Area dataset: rpool/distro_const/dc-slim
Build Area mount point: /rpool/distro_const/dc-slim ==== im-pop: Image area creation
Traceback (most recent call last):
File "/usr/share/distro_const/im_pop.py", line 34, in <module>
from osol_install.libtransfer import TM_E_SUCCESS
ModuleNotFoundError: No module named 'osol_install.libtransfer'
Child returned err 1
Build completed Fri Jun 30 21:19:37 2023
Build failed.
root@openindiana:/code/illumos-gate/distro-const#

Apparently, libtransfer should be provided by system/install package.

Actions #13

Updated by anil choudhary 8 months ago

Hi
I have installed the patch on hpe gen8 blade
Can you suggest me test cases to run to verify patch

Actions #14

Updated by Toomas Soome 8 months ago

anil choudhary wrote in #note-13:

Hi
I have installed the patch on hpe gen8 blade
Can you suggest me test cases to run to verify patch

I'd start from basics; create partitioning, create file system on it, run some disk/fs benchmark.

Actions #15

Updated by Gordon Ross 8 months ago

We can report positive test results with an HPE box using a P408i card.
(The version of code presently up in the CR)

Actions #16

Updated by anil choudhary 8 months ago

Basic test cases are successful
will put in production

Actions #17

Updated by anil choudhary 7 months ago

After apply patching in production I am getting following errors in dmesg

214882 214852   2   Jul 21 ?        6328:02 bin/sun86/webServer -p 1186 -m
~# dmesg |grep 214882
Jul 21 22:14:06 sockfs: [ID 431381 kern.warning] WARNING: bind with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 279960 kern.warning] WARNING: bind with uninitialized __sin6_src_id (-1025) on socket. Pid = 214882
Jul 21 22:14:06 ip: [ID 190396 kern.warning] WARNING: connect/send* with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 431381 kern.warning] WARNING: bind with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 279960 kern.warning] WARNING: bind with uninitialized __sin6_src_id (-1025) on socket. Pid = 214882
Jul 21 22:14:06 ip: [ID 190396 kern.warning] WARNING: connect/send* with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 431381 kern.warning] WARNING: bind with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 279960 kern.warning] WARNING: bind with uninitialized __sin6_src_id (-1025) on socket. Pid = 214882
Jul 21 22:14:06 ip: [ID 190396 kern.warning] WARNING: connect/send* with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 431381 kern.warning] WARNING: bind with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 279960 kern.warning] WARNING: bind with uninitialized __sin6_src_id (-1025) on socket. Pid = 214882
Jul 21 22:14:06 ip: [ID 190396 kern.warning] WARNING: connect/send* with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 431381 kern.warning] WARNING: bind with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
Jul 21 22:14:06 sockfs: [ID 279960 kern.warning] WARNING: bind with uninitialized __sin6_src_id (-1025) on socket. Pid = 214882
ip: [ID 190396 kern.warning] WARNING: connect/send* with uninitialized sin6_scope_id (-2107784) on socket. Pid = 214882
ps -aef |grep 214882
root 953415 951310 0 13:05:41 pts/5 0:00 grep 214882
214882 214852 2 Jul 21 ? 6330:44 sun86/webServer -p 1186 -m
Actions #18

Updated by anil choudhary 7 months ago

Production system is up from last 9 days we can close this bug

root@~# uptime
14:52:09 up 9 day(s), 19:20, 2 users, load average: 3.50, 3.53, 3.57
root@~# uname -a
SunOS scl-r1-c1-b4-gz 5.11 smartpqi5-0-g9b8bee96a3 i86pc i386 i86pc

Thanks,
Anil

Actions #19

Updated by Toomas Soome 7 months ago

anil choudhary wrote in #note-18:

Production system is up from last 9 days we can close this bug

root@~# uptime
14:52:09 up 9 day(s), 19:20, 2 users, load average: 3.50, 3.53, 3.57
root@~# uname -a
SunOS scl-r1-c1-b4-gz 5.11 smartpqi5-0-g9b8bee96a3 i86pc i386 i86pc

Thanks,
Anil

Hi!

Since we have had some updates in the code, may it be possible for you to rebuild with latest source and re-test, please?

thanks,
toomas

Actions #20

Updated by anil choudhary 7 months ago

Hi Toomas,
I have taken code from following link

https://github.com/racktopsystems/illumos-gate/commits/smartpqi5

didn't seee any change since I last build

can you share the link for where I need to take the code to build

Thanks ,
Anil

Actions #21

Updated by anil choudhary 7 months ago

As this is production system I need to know what exact changes done and will compile that only

Actions #22

Updated by Toomas Soome 7 months ago

anil choudhary wrote in #note-20:

Hi Toomas,
I have taken code from following link

https://github.com/racktopsystems/illumos-gate/commits/smartpqi5

didn't seee any change since I last build

can you share the link for where I need to take the code to build

Thanks ,
Anil

The link is Gerrit CR number above.

Actions #23

Updated by anil choudhary 7 months ago

Hi Toomas
can you give the steps to take this particular commit and build

Thanks,
Anil

Actions #24

Updated by anil choudhary 7 months ago

night build of latest code is successful

will do some basic test and update here

Thanks,
Anil

Actions #25

Updated by anil choudhary 6 months ago

put the build on test machine
will watch for few days

Thanks,
Anil

Actions #26

Updated by Toomas Soome 6 months ago

anil choudhary wrote in #note-25:

put the build on test machine
will watch for few days

Thanks,
Anil

Many thanks!

Toomas

Actions #27

Updated by anil choudhary 6 months ago

its up from last 48 hours

Thanks,
Anil

Actions #28

Updated by Sebastian Wiedenroth 5 months ago

Some additional testing:

We pulled this change into our version of SmartOS, had this running for a bit over a week, wrote 40TB+ onto a zfs pool where 8 disks are behind such a controller:

pci9005,902 (pciex9005,28f) [Adaptec Smart Storage PQI SAS], instance #0 (driver name: smartpqi)

Everything seems to work great so far. Thanks!

Actions #29

Updated by Dan McDonald 5 months ago

Sebastian Wiedenroth wrote in #note-28:

Some additional testing:

We pulled this change into our version of SmartOS, had this running for a bit over a week, wrote 40TB+ onto a zfs pool where 8 disks are behind such a controller:

These are 8 disks in HBA mode? Or as a HW RAID group? Or as 8 RAID0 single-disks?

Actions #30

Updated by Sebastian Wiedenroth 5 months ago

Dan McDonald wrote in #note-29:

These are 8 disks in HBA mode? Or as a HW RAID group? Or as 8 RAID0 single-disks?

As far as I understand there's no HW RAID configured, just HBA. Looks like this:

# diskinfo 
TYPE    DISK                    VID      PID              SIZE          RMV SSD
SCSI    c0t30000D1E01209280d0   ATA      TOSHIBA MG08ACA1 14902.00 GiB  no  no 
SCSI    c0t30000D1E01209282d0   ATA      TOSHIBA MG08ACA1 14902.00 GiB  no  no 
SCSI    c0t30000D1E01209281d0   ATA      TOSHIBA MG08ACA1 14902.00 GiB  no  no 
SCSI    c0t30000D1E01209283d0   ATA      TOSHIBA MG08ACA1 14902.00 GiB  no  no 
SCSI    c0t30000D1E01209284d0   ATA      ST16000NM001J-2T 14902.00 GiB  no  no 
SCSI    c0t30000D1E01209286d0   ATA      TOSHIBA MG08ACA1 14902.00 GiB  no  no 
SCSI    c0t30000D1E01209285d0   ATA      ST16000NM001J-2T 14902.00 GiB  no  no 
SCSI    c0t30000D1E01209287d0   ATA      TOSHIBA MG08ACA1 14902.00 GiB  no  no 
SCSI    c1t0d0                  Jetflash Transcend 16G      14.72 GiB   yes yes
SATA    c2t0d0                  TOSHIBA  MG08ACA16TEY     14902.00 GiB  no  no 
SATA    c2t1d0                  TOSHIBA  MG08ACA16TEY     14902.00 GiB  no  no 
SATA    c2t2d0                  WDC       WUH721816ALE6L0 14902.00 GiB  no  no 
SATA    c2t3d0                  TOSHIBA  MG08ACA16TEY     14902.00 GiB  no  no 
SATA    c2t4d0                  TOSHIBA  MG08ACA16TEY     14902.00 GiB  no  no 
SATA    c2t5d0                  WDC       WUH721816ALE6L0 14902.00 GiB  no  no 
NVME    c3t1d0                  SAMSUNG  MZQL23T8HCLS-00A07 3576.98 GiB   no  yes
NVME    c4t1d0                  SAMSUNG  MZQL23T8HCLS-00A07 3576.98 GiB   no  yes
# zpool status
  pool: data
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        data                       ONLINE       0     0     0
          raidz3-0                 ONLINE       0     0     0
            c0t30000D1E01209280d0  ONLINE       0     0     0
            c0t30000D1E01209281d0  ONLINE       0     0     0
            c0t30000D1E01209282d0  ONLINE       0     0     0
            c0t30000D1E01209283d0  ONLINE       0     0     0
            c0t30000D1E01209284d0  ONLINE       0     0     0
            c0t30000D1E01209285d0  ONLINE       0     0     0
            c0t30000D1E01209286d0  ONLINE       0     0     0
            c0t30000D1E01209287d0  ONLINE       0     0     0
            c2t0d0                 ONLINE       0     0     0
            c2t1d0                 ONLINE       0     0     0
            c2t2d0                 ONLINE       0     0     0
            c2t3d0                 ONLINE       0     0     0
            c2t4d0                 ONLINE       0     0     0
            c2t5d0                 ONLINE       0     0     0

errors: No known data errors

  pool: zones
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0

errors: No known data errors

The other SATA disks are behind an Intel ahci controller.

Actions #31

Updated by Sebastian Wiedenroth 5 months ago

More details about the device: https://www.microchip.com/en-us/product/hba-1100-8i

                vendor-name:  'Adaptec'
                device-name:  'Smart Storage PQI SAS'
                subsystem-name:  'HBA 1100-8i'

Actions #32

Updated by Toomas Soome 5 months ago

  • Description updated (diff)
  • Category set to driver - device drivers
  • % Done changed from 0 to 90
Actions #33

Updated by Toomas Soome 5 months ago

  • Description updated (diff)
Actions #34

Updated by Joshua M. Clulow 5 months ago

Testing Notes

I have attempted to synthesis testing notes from various sources; e.g.,

I also went through a lot of other mailing list threads, but haven't found other evidence of testing or of which specific models of controller have been tried with the driver, etc.

Hardware

We have reports of testing from various systems:

  • Microsemi Adaptec HBA 1100-8i (add-on card)
  • HPE Smart Array P408i family (unclear which specific controller)
  • HPE "Gen8 Blade", though a Gen8 blade would not have shipped with a storage controller that uses this driver as far as I can tell; either an unknown add-in card was used or it's not actually a Gen8 Blade but perhaps a Gen10?

It is not clear which firmware versions were running in the controllers when they were tested.

It is not clear on which of these systems the final revision of the driver has been tested, rather than an earlier version, though the Adaptec add-on card is likely the most recently tested.

I/O Load

A five day "stress test" with no reported problems at Tintri; unclear which controller, what pool topology, or what quantity of I/O activity.

40TB+ of writes over the course of a week into a ZFS pool that has 8 disks behind the Adaptec HBA 1100-8i add-on card.

It's not apparent whether any ZFS scrubs were done during or after these tests.

Failure Testing

At least one disk was pulled from a live system running an earlier version of the driver. It's not clear what happened when the disk was pulled with respect to the device nodes or ZFS pools, or if the system was under I/O load at the time, or if there were other disks attached to the same controller and whether they were interrupted or not.

Crash Dumps

It is not presently known if a crash dump can be written to disks behind the SmartPQI driver, and as such there are also no ::findleaks results.

Actions #35

Updated by Electric Monk 5 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

git commit 8226594fdd4479be135127f43632f1f995074654

commit  8226594fdd4479be135127f43632f1f995074654
Author: Rick McNeal <rick.mcneal@nexenta.com>
Date:   2023-10-08T06:33:07.000Z

    11380 Support for Microsemi SmartPQI HBA
    Portions contributed by: Garrett D'Amore <garrett@damore.org>
    Portions contributed by: Andrew Stormont <astormont@racktopsystems.com>
    Portions contributed by: Albert Lee <trisk@racktopsystems.com>
    Portions contributed by: Yuri Pankov <ypankov@tintri.com>
    Portions contributed by: Stuart Maybee <smaybee@tintri.com>
    Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
    Reviewed by: Evan Layton <evan.layton@nexenta.com>
    Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
    Reviewed by: Dan Fields <dan.fields@nexenta.com>
    Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
    Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
    Reviewed by: Randy Fishel <randy.fishel@nexenta.com>
    Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
    Reviewed by: Guy Morrogh <gmorrogh@tintri.com>
    Approved by: Joshua M. Clulow <josh@sysmgr.org>

Actions #36

Updated by Sebastian Wiedenroth 5 months ago

I tested writing a dump with reboot -d which worked.

The ::findleaks output looks like this:

> ::findleaks
mdb: failed to read slab at 0xfffffe27ad035fb8: no mapping for address
mdb: failed to read slab at 0xfffffe2784955fb8: no mapping for address
mdb: failed to read slab at 0xfffffe27ad035fb8: no mapping for address
CACHE             LEAKED           BUFFER CALLER
fffffe22fde4d008       1 fffffe2374f4fdf8 ?
fffffe22fde57008       5 fffffe2341906380 ?
fffffe22fde59008       6 fffffe23366b5420 ?
fffffe22fde6c008       1 fffffe2378813080 ?
------------------------------------------------------------------------
           Total      13 buffers, 1208 bytes

I'm not sure if those are related to SmartPQI, expected or something else.

The firmware version is reported as NOTICE: smartpqi0 - firmware version: 3.53-0

Actions #37

Updated by Dan McDonald 5 months ago

Sebastian Wiedenroth wrote in #note-36:

I tested writing a dump with reboot -d which worked.

So you had a "reboot -d" successfully dump using a smartpqi-attached (set of) disk(s)? Cool!

The ::findleaks output looks like this:

If you run a DEBUG kernel you can get more useful data from ::findleaks.

Since this is in -gate now, SmartOS nightly builds (and other distro builds) will have it available, including DEBUG builds. Contact me directly for SmartOS DEBUG-build details.

Actions #38

Updated by Joshua M. Clulow 4 months ago

  • Related to Bug #15979: kernel heap corruption detected during disk pulls on HPE appliance added
Actions #39

Updated by Electric Monk 4 months ago

git commit fdcec61a6b986cc2f167ee081f8a0a2609f0f1a6

commit  fdcec61a6b986cc2f167ee081f8a0a2609f0f1a6
Author: Yuri Pankov <ypankov@tintri.com>
Date:   2023-10-12T03:41:56.000Z

    11380 Support for Microsemi SmartPQI HBA (missed small fixes)
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Joshua M. Clulow <josh@sysmgr.org>

Actions

Also available in: Atom PDF