Project

General

Profile

Actions

Bug #15302

open

broadcom 3108 kernel driver update

Added by Lee Damon 5 months ago. Updated 2 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
driver - device drivers
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Greetings,

I have a new shiny fileserver that's running OmniOS 151038 LTS (also tested against 151044, same problem). The host includes a Broadcom 3108 "Invader" HBA. The host is having problems booting because the HBA is reporting problems. On digging into it we're seeing this in the devices' internal log:

35992: 22-12-19,18:46:59 WARNING:Host driver needs to be upgraded to enable extended LD support

/var/adm/messages has the following line:
Jan 4 12:17:21 hvfs1 mr_sas: [ID 728489 kern.info] mr_sas0: 0x1000:0x5d 0x15d9:0x809, irq:11 drv-ver:6.503.00.00ILLUMOS-20170524

If I'm reading it right, the current driver kernel version is 6.503.00.00.

Looking at https://www.supermicro.com/wdl/driver/SAS/Broadcom/3108/Driver/ I'm seeing a bunch of drivers, none of which have that low a number. The highest number I see is 07.723.02.00-1, dated 29 NOV, 2022.

This file server is unstable so we've had to pull it out of service. I'd really like to get the driver updated so I can put it back into service soon (I had to press my old test server into production to cover this).

thanks,
nomad


Files

mega_errors.txt (157 KB) mega_errors.txt Lee Damon, 2023-03-20 03:12 PM
PXL_20221219_183121402.jpg (1.81 MB) PXL_20221219_183121402.jpg Lee Damon, 2023-03-20 03:12 PM
Actions #1

Updated by Lee Damon 5 months ago

We've purchased a new card so we can get this critical server back online. I suspect this problem will crop up again in the near future as older cards get harder to find.

Once I get the new hardware and verify it is working I am willing to loan this card to the developer who works on this bug so they have hardware to work against. I would very much appreciate someone updating the driver soon so we don't get bitten by this again.

Actions #2

Updated by Dan McDonald 5 months ago

A quick peek over in FreeBSD's mrsas driver indicates that we need to recognize if a (firmware upgraded?!?) board supports "extended LD support". This is done by its function mrsas_update_ext_vd_details(), which checks for the control info's "max_lds" to be greater than 64 (which is our #defined MAX_LOGICAL_DRIVES in fusion.h).

We'll need to check for that max_lds value, and make adjustments. Alternatively we could just make everything big enough to hold the FreeBSD's MAX_LOGICAL_DRIVES_EXT value of 256 instead.

Nope, the control information in their mrsas driver has evolved quite a bit since the one in our Fusion update. Much more to bring in as part of this.

Actions #3

Updated by Lee Damon 5 months ago

I have the card available. Who should I send it to?

Actions #4

Updated by Lee Damon 3 months ago

This is becoming more critical. I've been informed that SuperMicro only has this unsupported HBA available for the class of file server we use. We're probably going to have to move to another ZFS-capable OS as a result.

Actions #5

Updated by Lee Damon 3 months ago

Sympotoms: the host boots and sees all of the drives. Over time as it runs it accumulates errors in the HBA's internal log.
Eventually, when the host is rebooted (e.g. for patching) it refuses to boot until those errors are cleared by going into the BIOS and resetting the card.

The rest of this is from the exchange I had with SuperMicro when the problem first manifested:

On boot it reports "AVAGO EFI SAS Driver is Unhealthy" then it refuses to continue booting without interaction. (See photo, attached).

When I go into the BIOS and go to Advanced -> Driver Health I see
"AVAGO EFI SAS Driver - Failed"

Selecting that I see
"AVAGO 3108 MegaRAID Configuration Required"

Selecting that I get a wall of text about "Backup power source needs attention or replacement"

I ack that by hitting 'y' (Since IIRC there is no backup power for this card, we don't need or want hardware RAID) I'm told "Critical Message handling completed. Please exit".

Hitting <ESC> I'm then told
"AVAGO 3108 MegaRAID Erconnect Required"
I hit <ENTER> and say "Ok" to "Proceed with Reconnecting the Controller"

It then says Healthy.

I save configuration and reset to exit the BIOS .... and the system reboots to exactly the same error.

mega_errors.txt is an excerpt from the system log.

From the HBA's log. note line 35992. That's the culpret:

35991: 22-12-19,18:46:05 Info:Time established as 12/19/22 18:46:05; (63 seconds since power on)
35992: 22-12-19,18:46:59 WARNING:Host driver needs to be upgraded to enable extended LD support

35993: 22-12-19,18:47:14 Info:Unexpected sense: PD 29(e0x27/s6) Path 5003048020d09086, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 10 0
35994: 22-12-19,18:47:14 Info:Unexpected sense: PD 37(e0x27/s2) Path 5003048020d09082, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 C 0
35995: 22-12-19,18:47:14 Info:Unexpected sense: PD 38(e0x27/s1) Path 5003048020d09081, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 B 0
35996: 22-12-19,18:47:14 Info:Unexpected sense: PD 39(e0x27/s3) Path 5003048020d09083, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 D 0
35997: 22-12-19,18:47:14 Info:Unexpected sense: PD 3a(e0x27/s4) Path 5003048020d09084, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 E 0
35998: 22-12-19,18:47:14 Info:Unexpected sense: PD 3b(e0x27/s0) Path 5003048020d09080, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 A 0
35999: 22-12-19,18:47:14 Info:Unexpected sense: PD 3c(e0x27/s5) Path 5003048020d09085, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 F 0
36000: 22-12-19,18:47:14 Info:Unexpected sense: PD 29(e0x27/s6) Path 5003048020d09086, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 10 0
36001: 22-12-19,18:47:14 Info:Unexpected sense: PD 37(e0x27/s2) Path 5003048020d09082, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 C 0
36002: 22-12-19,18:47:14 Info:Unexpected sense: PD 38(e0x27/s1) Path 5003048020d09081, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 B 0
36003: 22-12-19,18:47:14 Info:Unexpected sense: PD 39(e0x27/s3) Path 5003048020d09083, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 D 0
36004: 22-12-19,18:47:14 Info:Unexpected sense: PD 3a(e0x27/s4) Path 5003048020d09084, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 E 0
36005: 22-12-19,18:47:14 Info:Event log wrapped
36006: 22-12-19,18:47:14 Info:Unexpected sense: PD 3b(e0x27/s0) Path 5003048020d09080, CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00, Sense: 72 1 0 1D 0 0 0 E 9 C 0 0 0 0 0 0 0 4F 0 C2 A 0

Actions #6

Updated by Jason King 3 months ago

Just to provide a bit more detail here, if I'm decoding things correctly, it's doing an ATA-PASSTHROUGH (16) command, specifically the SMART command with a sub code of RETURN STATUS.

The sense code is:
Descriptor format, current data (0x72)
Sense Key is RECOVERED ERROR (0x01)
ASC/ASCQ is ATA PASS-THROUGH INFORMATION AVAILABLE (0x00/0x1D)
The status (last byte) is 0.

I guess it'd be good to confirm these are SATA drives. It seems like whatever is doing the SCSI<->SATA translation (which in this case isn't the OS) is behaving in a way that the HBA is not expecting. I'm not familiar enough with the existing driver to know if there's anything that could be added to handle this or not (but hopefully the detail is useful).

Actions #7

Updated by Lee Damon 3 months ago

The controller's firmware started out as 4.680.00-8519 and was updated to 4.68.00-8561 while trying to debug the problem.

Actions #8

Updated by Lee Damon 3 months ago

The drives in question are all SAS SSDs. There shouldn't be any SATA hardware past the HBA itself.

Actions #9

Updated by Lee Damon 3 months ago

Expander information, in case that's needed.

Device0 /devices/pci@33,0/pci8086,2032@2/pci15d9,808@0/iport@ff/enclosure@w5003048020d090bd,0:0
Class [ses]
Vendor : SMC
Product : SC846-P
Firmware revision : 100b
Chassis Serial Number : 5003048020d090bf
Target-port identifier : w5003048020d090bd

Device1 /devices/pci@33,0/pci8086,2032@2/pci15d9,808@0/iport@ff/enclosure@w5003048021259c3d,0:0
Class [ses]
Vendor : SMC
Product : SC826N4
Firmware revision : 100d
Chassis Serial Number : 5003048021259c3f
Target-port identifier : w5003048021259c3d

Actions #10

Updated by Lee Damon 2 months ago

The hardware vendor has informed me that 3008 cards are officially off the table after the one we buy this year (presuming we manage to get one - they only have 3 left) so this is now officially an Important Requirement for us. Is anyone working on this problem?

Actions

Also available in: Atom PDF