Project

General

Profile

Bug #5698

panic in mpt_sas: vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free

Added by Alexander Shvayakov about 4 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
driver - device drivers
Start date:
2015-03-09
Due date:
% Done:

100%

Estimated time:
36.00 h
Difficulty:
Medium
Tags:
needs-triage

Description

Hi,

I had LSI SAS9205-8e with the firmware 17.00.01.00

        Adapter Selected is a LSI SAS: SAS2308_2(B0) 

        Controller Number              : 1
        Controller                     : SAS2308_2(B0) 
        PCI Address                    : 00:07:00:00
        SAS Address                    : 500605b-0-056f-c930
        NVDATA Version (Default)       : 11.00.00.08
        NVDATA Version (Persistent)    : 11.00.00.08
        Firmware Product ID            : 0x2214 (IT)
        Firmware Version               : 17.00.01.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9205-8e
        BIOS Version                   : 07.33.00.00
        UEFI BSD Version               : 07.24.01.00
        FCODE Version                  : N/A
        Board Name                     : SAS9205-8e
        Board Assembly                 : H3-25360-04H
        Board Tracer Number            : SP24610951

I tried to update to the version 20.00.02.00 and got a kernel panic

I repeated it and got a panic again

At first I erased a flash on my HBA:

./sas2flsh -о -е 6 -с 1

and installed the new firmware

./sas2flash -o -f 9205-8e.bin -b mptsas2.rom  -c 1

The last message from the sas2flash :

http://i.imgur.com/8k165Up.png

fmdump -Vp -u 014fbc31-4173-6e4a-cb49-d7304cadd239
TIME                           UUID                                 SUNW-MSG-ID
Mar 09 2015 14:19:09.419859000 014fbc31-4173-6e4a-cb49-d7304cadd239 SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Mar 09 14:19:09.4133 ireport.os.sunos.panic.dump_available 0x0000000000000000
  Mar 09 14:17:44.8899 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 014fbc31-4173-6e4a-cb49-d7304cadd239
        code = SUNOS-8000-KL
        diag-time = 1425907149 413946
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = sw:///:path=/var/crash/unknown/.014fbc31-4173-6e4a-cb49-d7304cadd239
                resource = sw:///:path=/var/crash/unknown/.014fbc31-4173-6e4a-cb49-d7304cadd239
                savecore-succcess = 1
                dump-dir = /var/crash/unknown
                dump-files = vmdump.0
                os-instance-uuid = 014fbc31-4173-6e4a-cb49-d7304cadd239
                panicstr = vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free
                panicstack = genunix:vmem_hash_delete+9b () | genunix:vmem_xfree+4b () | genunix:vmem_free+23 () | genunix:rmfree+6e () | mpt_sas:mptsas_pkt_destroy_extern+cf () | mpt_sas:mptsas_scsi_destroy_pkt+75 () | scsi:scsi_destroy_pkt+1a () | ses:ses_callback+c1 () | mpt_sas:mptsas_pkt_comp+2b () | mpt_sas:mptsas_doneq_empty+ae () | mpt_sas:mptsas_intr+177 () | apix:apix_dispatch_by_vector+8c () | apix:apix_dispatch_lowlevel+25 () | unix:switch_sp_and_call+13 () | apix:apix_do_interrupt+387 () | unix:cmnint+ba () | unix:i86_mwait+d () | unix:cpu_idle_mwait+109 () | unix:idle+a7 () | unix:thread_start+8 () | 
                crashtime = 1425904511
                panic-time = Mon Mar  9 13:35:11 2015 CET
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x54fd9dcd 0x19068a38

cat /etc/release 
  OmniOS v11 r151012

uname -v
omnios-10b9c79

MB: Supermicro X9DRE-TF+/X9DR7-TF+

The file crash.0 is attached

Kind regards,
Alexander


Files

crash.0 (578 KB) crash.0 Alexander Shvayakov, 2015-03-09 03:18 PM
mptreset.c (1.67 KB) mptreset.c Origin: https://github.com/kojack/mptsastools/blob/master/mptreset.c Marcel Telka, 2017-01-19 04:23 PM

Related issues

Related to illumos gate - Feature #5016: improve mpt_sas auto request senseClosed2014-07-17

Actions
Related to illumos gate - Bug #7813: mpt_sas does not like concurrent HBA resetsNew2017-01-27

Actions

History

#1

Updated by Simon Klinkert almost 4 years ago

Same panic stack trace here, but with a faulty disk and a running zpool resilver after reboot:

Jul 22 19:01:07 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:01:07 yagg        Target 30 reset for command timeout recovery failed!
Jul 22 19:02:32 yagg genunix: [ID 107833 kern.notice] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:32 yagg        Timeout of 60 seconds expired with 1 commands on target 34 lun 0.
Jul 22 19:02:32 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:32 yagg        Disconnected command timeout for target 34 w5000c5002d7aebeb, enclosure 2
Jul 22 19:02:33 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:33 yagg        mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31170000
Jul 22 19:02:36 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:36 yagg        Log info 0x31140000 received for target 34 w5000c5002d7aebeb.
Jul 22 19:02:36 yagg        scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Jul 22 19:02:36 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:36 yagg        mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4a IOCLogInfo=0x0 target=34
Jul 22 19:02:36 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:36 yagg        mptsas_ioc_task_management failed try to reset ioc to recovery!
Jul 22 19:02:38 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:38 yagg        MPT Firmware version v13.0.1.0 (SAS2004)
Jul 22 19:02:38 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:38 yagg        mpt_sas0 MPI Version 0x200
Jul 22 19:02:38 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:02:38 yagg        mpt0: IOC Operational.
Jul 22 19:04:44 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:04:44 yagg        Target 34 reset for command timeout recovery failed!
Jul 22 19:04:47 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:04:47 yagg        mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31170000
Jul 22 19:04:47 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:04:47 yagg        mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31170000
Jul 22 19:04:47 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:04:47 yagg        Log info 0x31170000 received for target 34 p0.
Jul 22 19:04:47 yagg        scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
Jul 22 19:05:00 yagg fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
Jul 22 19:05:00 yagg EVENT-TIME: Wed Jul 22 19:05:00 MEST 2015
Jul 22 19:05:00 yagg PLATFORM: X8SIL, CSN: 0123456789, HOSTNAME: yagg
Jul 22 19:05:00 yagg SOURCE: zfs-diagnosis, REV: 1.0
Jul 22 19:05:00 yagg EVENT-ID: 2de13ccc-234e-4a1d-cefd-c3f0c52abddd
Jul 22 19:05:00 yagg DESC: The number of I/O errors associated with a ZFS device exceeded
Jul 22 19:05:00 yagg             acceptable levels.  Refer to http://illumos.org/msg/ZFS-8000-FD for more information.
Jul 22 19:05:00 yagg AUTO-RESPONSE: The device has been offlined and marked as faulted.  An attempt
Jul 22 19:05:00 yagg             will be made to activate a hot spare if available.
Jul 22 19:05:00 yagg IMPACT: Fault tolerance of the pool may be compromised.
Jul 22 19:05:00 yagg REC-ACTION: Run 'zpool status -x' and replace the bad device.
Jul 22 19:16:27 yagg genunix: [ID 107833 kern.notice] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:27 yagg        Timeout of 60 seconds expired with 1 commands on target 34 lun 0.
Jul 22 19:16:27 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:27 yagg        Disconnected command timeout for target 34 w5000c5002d7aebeb, enclosure 2
Jul 22 19:16:27 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:27 yagg        mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31170000
Jul 22 19:16:31 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:31 yagg        Log info 0x31140000 received for target 34 w5000c5002d7aebeb.
Jul 22 19:16:31 yagg        scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Jul 22 19:16:31 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:31 yagg        mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4a IOCLogInfo=0x0 target=34
Jul 22 19:16:31 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:31 yagg        mptsas_ioc_task_management failed try to reset ioc to recovery!
Jul 22 19:16:32 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:32 yagg        MPT Firmware version v13.0.1.0 (SAS2004)
Jul 22 19:16:32 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:32 yagg        mpt_sas0 MPI Version 0x200
Jul 22 19:16:32 yagg genunix: [ID 723447 kern.warning] WARNING: vmem_destroy('rmap'): leaked 8 bytes
Jul 22 19:16:32 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:16:32 yagg        mpt0: IOC Operational.
Jul 22 19:18:39 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
Jul 22 19:18:39 yagg unix: [ID 836849 kern.notice]
Jul 22 19:18:39 yagg ^Mpanic[cpu3]/thread=ffffff001ec4bc40:
Jul 22 19:18:39 yagg genunix: [ID 697804 kern.notice] vmem_hash_delete(ffffff04ec874000, 1, 8): bad free
Jul 22 19:18:39 yagg unix: [ID 100000 kern.notice]
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b900 genunix:vmem_hash_delete+9b ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b960 genunix:vmem_xfree+4b ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b990 genunix:vmem_free+23 ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b9e0 genunix:rmfree+6e ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4ba10 mpt_sas:mptsas_pkt_destroy_extern+cf ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4ba40 mpt_sas:mptsas_scsi_destroy_pkt+75 ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4ba60 scsi:scsi_destroy_pkt+1a ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bab0 ses:ses_callback+c1 ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bae0 mpt_sas:mptsas_pkt_comp+2b ()
Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bb30 mpt_sas:mptsas_doneq_empty+ae ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bb70 mpt_sas:mptsas_intr+177 ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bbe0 unix:av_dispatch_autovect+91 ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bc20 unix:dispatch_hardint+36 ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15a60 unix:switch_sp_and_call+13 ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15ac0 unix:do_interrupt+a8 ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15ad0 unix:cmnint+ba ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15bc0 unix:i86_mwait+d ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15c00 unix:cpu_idle_mwait+109 ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15c20 unix:idle+a7 ()
Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15c30 unix:thread_start+8 ()
#2

Updated by Marcel Telka over 2 years ago

Another one (note: "leaked 8 bytes" just few seconds before the panic):

> ::status
debugging crash dump vmcore.0.glatra.20161105-1151 (64-bit) from glatra
operating system: 5.11 titanic_30 (i86pc)
image uuid: bc701646-ea61-c47b-f73c-83a3d72e1176
panic message: vmem_hash_delete(ffffff066d020000, 1, 8): bad free
dump content: kernel pages only
> ::stack
vpanic()
vmem_hash_delete+0x9b(ffffff066d020000, 1, 8)
vmem_xfree+0x4b(ffffff066d020000, 1, 8)
vmem_free+0x23(ffffff066d020000, 1, 8)
rmfree+0x6e(ffffff066d020000, 8, 1)
mptsas_pkt_destroy_extern+0xcf(ffffff066bbdd000, ffffffecdf782da8)
mptsas_scsi_destroy_pkt+0x75(ffffffecdf782f38, ffffffecdf782f30)
scsi_destroy_pkt+0x1a(ffffffecdf782f30)
sd_destroypkt_for_uscsi+0xce(ffffff0a57811e80)
sd_return_command+0x10e(ffffff066c1d7000, ffffff0a57811e80)
sd_sense_key_recoverable_error+0x104(ffffff066c1d7000, ffffff44d49654e9, ffffff0a57811e80, ffffff44d4965480, ffffffecdf782f30)
sd_decode_sense+0xcc(ffffff066c1d7000, ffffff0a57811e80, ffffff44d4965480, ffffffecdf782f30)
sd_handle_auto_request_sense+0xcd(ffffff066c1d7000, ffffff0a57811e80, ffffff44d4965480, ffffffecdf782f30)
sdintr+0x391(ffffffecdf782f30)
mptsas_pkt_comp+0x2b(ffffffecdf782f30, ffffffecdf782da8)
mptsas_doneq_empty+0xae(ffffff066bbdd000)
mptsas_intr+0x177(ffffff066bbdd000, 0)
av_dispatch_autovect+0x91(1a)
dispatch_hardint+0x36(1a, 0)
switch_sp_and_call+0x13()
do_interrupt+0xa8(ffffff002da5cf10, 816adb0)
_interrupt+0xba()
> ::msgbuf -v ! tail -n 58|head -n 40
2016 Nov  5 11:20:02 ffffff48f6e38bf0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        Log info 0x31140000 received for target 19 w5000cca234dab8a5.
        scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
2016 Nov  5 11:20:02 ffffff0a52ded2f0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        Log info 0x31140000 received for target 19 w5000cca234dab8a5.
        scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
2016 Nov  5 11:20:02 ffffff495d294cf0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        Log info 0x31140000 received for target 19 w5000cca234dab8a5.
        scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
2016 Nov  5 11:21:31 ffffff44cc002130 WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        passthrough command timeout
2016 Nov  5 11:21:33 ffffff06778ed0b0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        MPT Firmware version v19.0.0.0 (SAS2004)
2016 Nov  5 11:21:33 ffffff6f09a78d70 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        mpt_sas0 MPI Version 0x200
2016 Nov  5 11:21:33 ffffff497b4aee30 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        mpt0: IOC Operational.
2016 Nov  5 11:23:20 ffffffed0dacaef0 WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        passthrough command timeout
2016 Nov  5 11:23:21 ffffff4948099170 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        MPT Firmware version v19.0.0.0 (SAS2004)
2016 Nov  5 11:23:21 ffffff351a16ee70 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        mpt_sas0 MPI Version 0x200
2016 Nov  5 11:23:21 ffffff0a36666770 WARNING: vmem_destroy('rmap'): leaked 8 bytes
2016 Nov  5 11:23:21 ffffff0ae50e6ab0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
        mpt0: IOC Operational.
2016 Nov  5 11:23:35 ffffff0af36ae930 
panic[cpu1]/thread=ffffff001eb88c40: 
2016 Nov  5 11:23:35 ffffff066a517770 vmem_hash_delete(ffffff066d020000, 1, 8): bad free
2016 Nov  5 11:23:35 ffffff5368328930 

2016 Nov  5 11:23:35 ffffff06df6444b0 ffffff001eb886f0 genunix:vmem_hash_delete+9b ()
2016 Nov  5 11:23:36 ffffff0676d39ef0 ffffff001eb88750 genunix:vmem_xfree+4b ()
2016 Nov  5 11:23:36 ffffffb5b1a0dc70 ffffff001eb88780 genunix:vmem_free+23 ()
2016 Nov  5 11:23:36 ffffff06d1c854b0 ffffff001eb887d0 genunix:rmfree+6e ()
2016 Nov  5 11:23:36 ffffffecdc008870 ffffff001eb88800 mpt_sas:mptsas_pkt_destroy_extern+cf ()
2016 Nov  5 11:23:36 ffffffe56367e9f0 ffffff001eb88830 mpt_sas:mptsas_scsi_destroy_pkt+75 ()
2016 Nov  5 11:23:36 ffffffecdf193330 ffffff001eb88850 scsi:scsi_destroy_pkt+1a ()
2016 Nov  5 11:23:36 ffffff0a438ac2f0 ffffff001eb88890 sd:sd_destroypkt_for_uscsi+ce ()
2016 Nov  5 11:23:36 ffffffde1036aeb0 ffffff001eb888f0 sd:sd_return_command+10e ()
>
#3

Updated by Marcel Telka over 2 years ago

  • Subject changed from My System Crashed during the LSI SAS9205-8e firmware update to panic in mpt_sas: vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free
  • Category set to driver - device drivers
#4

Updated by Marcel Telka over 2 years ago

  • Related to Feature #5016: improve mpt_sas auto request sense added
#5

Updated by Marcel Telka over 2 years ago

  • Priority changed from Normal to High

Happened again today:

> ::status
debugging crash dump vmcore.0 (64-bit) from efrem
operating system: 5.11 titanic_29 (i86pc)
image uuid: aa68d3bc-cd31-429d-efa4-8c0bb262f3a7
panic message: vmem_hash_delete(ffffff0676299000, 1, 8): bad free
dump content: kernel pages only
> ::stack
vpanic()
vmem_hash_delete+0x9b(ffffff0676299000, 1, 8)
vmem_xfree+0x4b(ffffff0676299000, 1, 8)
vmem_free+0x23(ffffff0676299000, 1, 8)
rmfree+0x6e(ffffff0676299000, 8, 1)
mptsas_pkt_destroy_extern+0xcf(ffffff0674ac9000, ffffff08be6ec8a8)
mptsas_scsi_destroy_pkt+0x75(ffffff08be6eca38, ffffff08be6eca30)
scsi_destroy_pkt+0x1a(ffffff08be6eca30)
ses_callback+0xc1(ffffff08be6eca30)
mptsas_pkt_comp+0x2b(ffffff08be6eca30, ffffff08be6ec8a8)
mptsas_doneq_thread+0x7f(ffffff0675c8d830)
thread_start+8()
>
#6

Updated by Igor Kozhukhov over 2 years ago

what is it base system?
how old mpt_sas?
will be better to know latest illumos revision for this crash.

#7

Updated by Marcel Telka over 2 years ago

Igor Kozhukhov wrote:

what is it base system?
how old mpt_sas?
will be better to know latest illumos revision for this crash.

This is happening on a systems with the identical mpt_sas driver sources as it is in the latest illumos, with one exception: the 6299 fix is not backported there.

#8

Updated by Igor Kozhukhov over 2 years ago

Marcel Telka wrote:

Igor Kozhukhov wrote:

what is it base system?
how old mpt_sas?
will be better to know latest illumos revision for this crash.

This is happening on a systems with the identical mpt_sas driver sources as it is in the latest illumos, with one exception: the 6299 fix is not backported there.

will be interested in test result based on latest smartos, they have additional patch not upstreamed yet to illumos.

#9

Updated by Dan Fields over 2 years ago

I can take this one - looks like it was introduced by:

commit 60bf7f6fe2a9d65f5b54de946358dfd8ff1e46e0
Author: Andy Giles <>
Date: Wed Jun 18 16:56:29 2014 +0200

5016 improve mpt_sas auto request sense
Reviewed by: Hans Rosenfeld &lt;&gt;
Reviewed by: Garrett D'Amore &lt;&gt;
Reviewed by: Dan McDonald &lt;&gt;
Approved by: Gordon Ross &lt;&gt;
#10

Updated by Dan Fields over 2 years ago

  • Assignee set to Dan Fields
  • % Done changed from 0 to 40
  • Estimated time set to 36.00 h

The m_req_sense buffer is allocated with kmem_alloc() and setup for DMA.
The extra sense buffer portion of the DMA memory, marked by m_extreq_sense
is managed by rmalloc() in quantums of 0x20 size blocks. The idea
is that rmalloc() will find and allocate contiguous 0x20 hex length blocks
to accommodate the extra sense size. The maximum extra sense is 252 bytes
which rounds to 8 contiguous blocks)

 * +-----------------------------------+ m_req_sense
 * |                0x20               |
 * +-----------------------------------+
 * |                 .                 |
 * |                 .                 |
 * |       m_max_requests x 0x20       |
 * |                 .                 |
 * |                 .                 |
 * +-----------------------------------+m_extreq_sense
 * |                0x20               |\
 * +-----------------------------------+ \
 * |                0x20               |  |
 * +-----------------------------------+  |
 * |                0x20               |  |
 * +-----------------------------------+  |-- managed by rmalloc()
 * |                                   |  |
 * |                                   |  |
 * |                                   | /
 * +-----------------------------------+/

I'm not sure how this ever worked properly - the memory at the location at 'm_extreq_sense', which is preallocated,
was never setup in the rmalloc() system, instead the address '1' was given. Then when the call to rmalloc() was made
it was incremented by one - which I think was for the purpose of not using the 'zero' position, but that might have been
a misapplication of how the RMID=0 is reserved for system messages. It doesn't apply to the extra sense buffers which is a
pool to be allocated as needed.

#11

Updated by Dan Fields over 2 years ago

  • Assignee changed from Dan Fields to Marcel Telka
  • % Done changed from 40 to 70

Marcel has a fix.

#12

Updated by Marcel Telka over 2 years ago

Steps to reproduce

Run the following two scripts concurrently:

# while true ; do sg_raw -r 1k /dev/rdsk/DISK 12 00 00 00 60 00 ; done &
# while true ; do ./mptreset DEVCTL ; sleep 1 ; done &

Replacing DISK by the real disk and DEVCTL by the path suggested by mptreset when run without parameters.

With the above the system should panic (almost) immediately.

#13

Updated by Marcel Telka over 2 years ago

  • Status changed from New to In Progress
#14

Updated by Marcel Telka over 2 years ago

Root cause

The panic happens when there is mptsas_restart_ioc() -> mptsas_init_chip() -> mptsas_alloc_sense_bufs() codepath executed, while there is some outstanding command waiting for a reply. In mptsas_alloc_sense_bufs() we rmfreemap/rmallocmap the resource map, including the space used by the pending command (here the "leaked 8 bytes" warning comes from). Once the pending command receives its reply the space used by the command is freed and we hit the panic with the "bad free" string.

How such a case can ever happen?

  2  -> mptsas_scsi_init_pkt
  2    -> mptsas_pkt_alloc_extern
  2      -> mptsas_cmdarqsize
  2        -> rmalloc
              mpt_sas`mptsas_cmdarqsize+0x88
              mpt_sas`mptsas_pkt_alloc_extern+0xa3
              mpt_sas`mptsas_scsi_init_pkt+0x5a5
              scsi`scsi_init_pkt+0x63
              sd`sd_initpkt_for_uscsi+0x8f
              sd`sd_start_cmds+0xcc
              sd`sd_core_iostart+0x90
              sd`sd_uscsi_strategy+0xfa
              genunix`default_physio+0x2db
              genunix`physio+0x25
              scsi`scsi_uscsi_handle_cmd+0x29d
              sd`sd_ssc_send+0x136
              sd`sdioctl+0xa96
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x149

  2        <- rmalloc
  2      <- mptsas_cmdarqsize
  2    <- mptsas_pkt_alloc_extern
  2  <- mptsas_scsi_init_pkt

The scsi_pkt/mptsas_cmd is allocated without any protection. In mptsas_cmdarqsize() both the DMA memory and a space from the resource map are allocated/assigned to the mptsas_cmd. While the allocation is not protected, it is possible (and it is very likely happening) that the reset comes after the mptsas_cmd allocation, but before the command is executed (via mptsas_scsi_start()).

During the reset both the DMA memory and the resource map are freed and then re-allocated (in mptsas_alloc_sense_bufs()), but the mptsas_cmd allocated above still references now freed DMA memory and now freed space from the resource map. Once the reply to the command arrives, it is stored in the freed DMA memory area. Later, when the command (mptsas_cmd) is freed, the space from the resource map is freed too and this causes the panic, because the resource map was already freed some time ago.

This is very likely causing silent memory corruptions too (in addition to the panic we see). Currently, this is probably not very serious, because if the corruption happens we will panic anyway.

Fix description

The fix is counting the extreq_sense allocations and makes sure it goes back to zero during the HBA reset. This will make sure there are neither DMA buffers for sense data assigned to any allocated packet nor any space allocated from the resource map when the mptsas_alloc_sense_bufs() is called.

#15

Updated by Marcel Telka about 2 years ago

  • Related to Bug #7813: mpt_sas does not like concurrent HBA resets added
#17

Updated by Marcel Telka about 2 years ago

  • Status changed from In Progress to Pending RTI
#18

Updated by Electric Monk about 2 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 70 to 100

git commit 2482ae1b96a558eec551575934d5f06c87b807af

commit  2482ae1b96a558eec551575934d5f06c87b807af
Author: Marcel Telka <marcel@telka.sk>
Date:   2017-02-02T12:31:12.000Z

    5698 panic in mpt_sas: vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free
    Reviewed by: Dan Fields <danfields@fastmail.us>
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>

Also available in: Atom PDF