Bug #5698
closedpanic in mpt_sas: vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free
100%
Description
Hi,
I had LSI SAS9205-8e with the firmware 17.00.01.00
Adapter Selected is a LSI SAS: SAS2308_2(B0) Controller Number : 1 Controller : SAS2308_2(B0) PCI Address : 00:07:00:00 SAS Address : 500605b-0-056f-c930 NVDATA Version (Default) : 11.00.00.08 NVDATA Version (Persistent) : 11.00.00.08 Firmware Product ID : 0x2214 (IT) Firmware Version : 17.00.01.00 NVDATA Vendor : LSI NVDATA Product ID : SAS9205-8e BIOS Version : 07.33.00.00 UEFI BSD Version : 07.24.01.00 FCODE Version : N/A Board Name : SAS9205-8e Board Assembly : H3-25360-04H Board Tracer Number : SP24610951
I tried to update to the version 20.00.02.00 and got a kernel panic
I repeated it and got a panic again
At first I erased a flash on my HBA:
./sas2flsh -о -е 6 -с 1
and installed the new firmware
./sas2flash -o -f 9205-8e.bin -b mptsas2.rom -c 1
The last message from the sas2flash :
http://i.imgur.com/8k165Up.png
fmdump -Vp -u 014fbc31-4173-6e4a-cb49-d7304cadd239 TIME UUID SUNW-MSG-ID Mar 09 2015 14:19:09.419859000 014fbc31-4173-6e4a-cb49-d7304cadd239 SUNOS-8000-KL TIME CLASS ENA Mar 09 14:19:09.4133 ireport.os.sunos.panic.dump_available 0x0000000000000000 Mar 09 14:17:44.8899 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000 nvlist version: 0 version = 0x0 class = list.suspect uuid = 014fbc31-4173-6e4a-cb49-d7304cadd239 code = SUNOS-8000-KL diag-time = 1425907149 413946 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash/unknown/.014fbc31-4173-6e4a-cb49-d7304cadd239 resource = sw:///:path=/var/crash/unknown/.014fbc31-4173-6e4a-cb49-d7304cadd239 savecore-succcess = 1 dump-dir = /var/crash/unknown dump-files = vmdump.0 os-instance-uuid = 014fbc31-4173-6e4a-cb49-d7304cadd239 panicstr = vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free panicstack = genunix:vmem_hash_delete+9b () | genunix:vmem_xfree+4b () | genunix:vmem_free+23 () | genunix:rmfree+6e () | mpt_sas:mptsas_pkt_destroy_extern+cf () | mpt_sas:mptsas_scsi_destroy_pkt+75 () | scsi:scsi_destroy_pkt+1a () | ses:ses_callback+c1 () | mpt_sas:mptsas_pkt_comp+2b () | mpt_sas:mptsas_doneq_empty+ae () | mpt_sas:mptsas_intr+177 () | apix:apix_dispatch_by_vector+8c () | apix:apix_dispatch_lowlevel+25 () | unix:switch_sp_and_call+13 () | apix:apix_do_interrupt+387 () | unix:cmnint+ba () | unix:i86_mwait+d () | unix:cpu_idle_mwait+109 () | unix:idle+a7 () | unix:thread_start+8 () | crashtime = 1425904511 panic-time = Mon Mar 9 13:35:11 2015 CET (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x54fd9dcd 0x19068a38
cat /etc/release OmniOS v11 r151012 uname -v omnios-10b9c79
MB: Supermicro X9DRE-TF+/X9DR7-TF+
The file crash.0 is attached
Kind regards,
Alexander
Files
Related issues
Updated by Simon K over 8 years ago
Same panic stack trace here, but with a faulty disk and a running zpool resilver after reboot:
Jul 22 19:01:07 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:01:07 yagg Target 30 reset for command timeout recovery failed! Jul 22 19:02:32 yagg genunix: [ID 107833 kern.notice] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:32 yagg Timeout of 60 seconds expired with 1 commands on target 34 lun 0. Jul 22 19:02:32 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:32 yagg Disconnected command timeout for target 34 w5000c5002d7aebeb, enclosure 2 Jul 22 19:02:33 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:33 yagg mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31170000 Jul 22 19:02:36 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:36 yagg Log info 0x31140000 received for target 34 w5000c5002d7aebeb. Jul 22 19:02:36 yagg scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Jul 22 19:02:36 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:36 yagg mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4a IOCLogInfo=0x0 target=34 Jul 22 19:02:36 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:36 yagg mptsas_ioc_task_management failed try to reset ioc to recovery! Jul 22 19:02:38 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:38 yagg MPT Firmware version v13.0.1.0 (SAS2004) Jul 22 19:02:38 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:38 yagg mpt_sas0 MPI Version 0x200 Jul 22 19:02:38 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:02:38 yagg mpt0: IOC Operational. Jul 22 19:04:44 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:04:44 yagg Target 34 reset for command timeout recovery failed! Jul 22 19:04:47 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:04:47 yagg mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31170000 Jul 22 19:04:47 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:04:47 yagg mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31170000 Jul 22 19:04:47 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:04:47 yagg Log info 0x31170000 received for target 34 p0. Jul 22 19:04:47 yagg scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Jul 22 19:05:00 yagg fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Jul 22 19:05:00 yagg EVENT-TIME: Wed Jul 22 19:05:00 MEST 2015 Jul 22 19:05:00 yagg PLATFORM: X8SIL, CSN: 0123456789, HOSTNAME: yagg Jul 22 19:05:00 yagg SOURCE: zfs-diagnosis, REV: 1.0 Jul 22 19:05:00 yagg EVENT-ID: 2de13ccc-234e-4a1d-cefd-c3f0c52abddd Jul 22 19:05:00 yagg DESC: The number of I/O errors associated with a ZFS device exceeded Jul 22 19:05:00 yagg acceptable levels. Refer to http://illumos.org/msg/ZFS-8000-FD for more information. Jul 22 19:05:00 yagg AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Jul 22 19:05:00 yagg will be made to activate a hot spare if available. Jul 22 19:05:00 yagg IMPACT: Fault tolerance of the pool may be compromised. Jul 22 19:05:00 yagg REC-ACTION: Run 'zpool status -x' and replace the bad device. Jul 22 19:16:27 yagg genunix: [ID 107833 kern.notice] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:27 yagg Timeout of 60 seconds expired with 1 commands on target 34 lun 0. Jul 22 19:16:27 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:27 yagg Disconnected command timeout for target 34 w5000c5002d7aebeb, enclosure 2 Jul 22 19:16:27 yagg genunix: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:27 yagg mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31170000 Jul 22 19:16:31 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:31 yagg Log info 0x31140000 received for target 34 w5000c5002d7aebeb. Jul 22 19:16:31 yagg scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Jul 22 19:16:31 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:31 yagg mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4a IOCLogInfo=0x0 target=34 Jul 22 19:16:31 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:31 yagg mptsas_ioc_task_management failed try to reset ioc to recovery! Jul 22 19:16:32 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:32 yagg MPT Firmware version v13.0.1.0 (SAS2004) Jul 22 19:16:32 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:32 yagg mpt_sas0 MPI Version 0x200 Jul 22 19:16:32 yagg genunix: [ID 723447 kern.warning] WARNING: vmem_destroy('rmap'): leaked 8 bytes Jul 22 19:16:32 yagg genunix: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:16:32 yagg mpt0: IOC Operational. Jul 22 19:18:39 yagg genunix: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Jul 22 19:18:39 yagg unix: [ID 836849 kern.notice] Jul 22 19:18:39 yagg ^Mpanic[cpu3]/thread=ffffff001ec4bc40: Jul 22 19:18:39 yagg genunix: [ID 697804 kern.notice] vmem_hash_delete(ffffff04ec874000, 1, 8): bad free Jul 22 19:18:39 yagg unix: [ID 100000 kern.notice] Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b900 genunix:vmem_hash_delete+9b () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b960 genunix:vmem_xfree+4b () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b990 genunix:vmem_free+23 () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4b9e0 genunix:rmfree+6e () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4ba10 mpt_sas:mptsas_pkt_destroy_extern+cf () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4ba40 mpt_sas:mptsas_scsi_destroy_pkt+75 () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4ba60 scsi:scsi_destroy_pkt+1a () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bab0 ses:ses_callback+c1 () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bae0 mpt_sas:mptsas_pkt_comp+2b () Jul 22 19:18:39 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bb30 mpt_sas:mptsas_doneq_empty+ae () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bb70 mpt_sas:mptsas_intr+177 () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bbe0 unix:av_dispatch_autovect+91 () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec4bc20 unix:dispatch_hardint+36 () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15a60 unix:switch_sp_and_call+13 () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15ac0 unix:do_interrupt+a8 () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15ad0 unix:cmnint+ba () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15bc0 unix:i86_mwait+d () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15c00 unix:cpu_idle_mwait+109 () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15c20 unix:idle+a7 () Jul 22 19:18:40 yagg genunix: [ID 655072 kern.notice] ffffff001ec15c30 unix:thread_start+8 ()
Updated by Marcel Telka about 7 years ago
Another one (note: "leaked 8 bytes" just few seconds before the panic):
> ::status debugging crash dump vmcore.0.glatra.20161105-1151 (64-bit) from glatra operating system: 5.11 titanic_30 (i86pc) image uuid: bc701646-ea61-c47b-f73c-83a3d72e1176 panic message: vmem_hash_delete(ffffff066d020000, 1, 8): bad free dump content: kernel pages only > ::stack vpanic() vmem_hash_delete+0x9b(ffffff066d020000, 1, 8) vmem_xfree+0x4b(ffffff066d020000, 1, 8) vmem_free+0x23(ffffff066d020000, 1, 8) rmfree+0x6e(ffffff066d020000, 8, 1) mptsas_pkt_destroy_extern+0xcf(ffffff066bbdd000, ffffffecdf782da8) mptsas_scsi_destroy_pkt+0x75(ffffffecdf782f38, ffffffecdf782f30) scsi_destroy_pkt+0x1a(ffffffecdf782f30) sd_destroypkt_for_uscsi+0xce(ffffff0a57811e80) sd_return_command+0x10e(ffffff066c1d7000, ffffff0a57811e80) sd_sense_key_recoverable_error+0x104(ffffff066c1d7000, ffffff44d49654e9, ffffff0a57811e80, ffffff44d4965480, ffffffecdf782f30) sd_decode_sense+0xcc(ffffff066c1d7000, ffffff0a57811e80, ffffff44d4965480, ffffffecdf782f30) sd_handle_auto_request_sense+0xcd(ffffff066c1d7000, ffffff0a57811e80, ffffff44d4965480, ffffffecdf782f30) sdintr+0x391(ffffffecdf782f30) mptsas_pkt_comp+0x2b(ffffffecdf782f30, ffffffecdf782da8) mptsas_doneq_empty+0xae(ffffff066bbdd000) mptsas_intr+0x177(ffffff066bbdd000, 0) av_dispatch_autovect+0x91(1a) dispatch_hardint+0x36(1a, 0) switch_sp_and_call+0x13() do_interrupt+0xa8(ffffff002da5cf10, 816adb0) _interrupt+0xba() > ::msgbuf -v ! tail -n 58|head -n 40 2016 Nov 5 11:20:02 ffffff48f6e38bf0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Log info 0x31140000 received for target 19 w5000cca234dab8a5. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc 2016 Nov 5 11:20:02 ffffff0a52ded2f0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Log info 0x31140000 received for target 19 w5000cca234dab8a5. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc 2016 Nov 5 11:20:02 ffffff495d294cf0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): Log info 0x31140000 received for target 19 w5000cca234dab8a5. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc 2016 Nov 5 11:21:31 ffffff44cc002130 WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): passthrough command timeout 2016 Nov 5 11:21:33 ffffff06778ed0b0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): MPT Firmware version v19.0.0.0 (SAS2004) 2016 Nov 5 11:21:33 ffffff6f09a78d70 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): mpt_sas0 MPI Version 0x200 2016 Nov 5 11:21:33 ffffff497b4aee30 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): mpt0: IOC Operational. 2016 Nov 5 11:23:20 ffffffed0dacaef0 WARNING: /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): passthrough command timeout 2016 Nov 5 11:23:21 ffffff4948099170 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): MPT Firmware version v19.0.0.0 (SAS2004) 2016 Nov 5 11:23:21 ffffff351a16ee70 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): mpt_sas0 MPI Version 0x200 2016 Nov 5 11:23:21 ffffff0a36666770 WARNING: vmem_destroy('rmap'): leaked 8 bytes 2016 Nov 5 11:23:21 ffffff0ae50e6ab0 /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0): mpt0: IOC Operational. 2016 Nov 5 11:23:35 ffffff0af36ae930 panic[cpu1]/thread=ffffff001eb88c40: 2016 Nov 5 11:23:35 ffffff066a517770 vmem_hash_delete(ffffff066d020000, 1, 8): bad free 2016 Nov 5 11:23:35 ffffff5368328930 2016 Nov 5 11:23:35 ffffff06df6444b0 ffffff001eb886f0 genunix:vmem_hash_delete+9b () 2016 Nov 5 11:23:36 ffffff0676d39ef0 ffffff001eb88750 genunix:vmem_xfree+4b () 2016 Nov 5 11:23:36 ffffffb5b1a0dc70 ffffff001eb88780 genunix:vmem_free+23 () 2016 Nov 5 11:23:36 ffffff06d1c854b0 ffffff001eb887d0 genunix:rmfree+6e () 2016 Nov 5 11:23:36 ffffffecdc008870 ffffff001eb88800 mpt_sas:mptsas_pkt_destroy_extern+cf () 2016 Nov 5 11:23:36 ffffffe56367e9f0 ffffff001eb88830 mpt_sas:mptsas_scsi_destroy_pkt+75 () 2016 Nov 5 11:23:36 ffffffecdf193330 ffffff001eb88850 scsi:scsi_destroy_pkt+1a () 2016 Nov 5 11:23:36 ffffff0a438ac2f0 ffffff001eb88890 sd:sd_destroypkt_for_uscsi+ce () 2016 Nov 5 11:23:36 ffffffde1036aeb0 ffffff001eb888f0 sd:sd_return_command+10e () >
Updated by Marcel Telka about 7 years ago
- Subject changed from My System Crashed during the LSI SAS9205-8e firmware update to panic in mpt_sas: vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free
- Category set to driver - device drivers
Updated by Marcel Telka about 7 years ago
- Related to Feature #5016: improve mpt_sas auto request sense added
Updated by Marcel Telka about 7 years ago
- Priority changed from Normal to High
Happened again today:
> ::status debugging crash dump vmcore.0 (64-bit) from efrem operating system: 5.11 titanic_29 (i86pc) image uuid: aa68d3bc-cd31-429d-efa4-8c0bb262f3a7 panic message: vmem_hash_delete(ffffff0676299000, 1, 8): bad free dump content: kernel pages only > ::stack vpanic() vmem_hash_delete+0x9b(ffffff0676299000, 1, 8) vmem_xfree+0x4b(ffffff0676299000, 1, 8) vmem_free+0x23(ffffff0676299000, 1, 8) rmfree+0x6e(ffffff0676299000, 8, 1) mptsas_pkt_destroy_extern+0xcf(ffffff0674ac9000, ffffff08be6ec8a8) mptsas_scsi_destroy_pkt+0x75(ffffff08be6eca38, ffffff08be6eca30) scsi_destroy_pkt+0x1a(ffffff08be6eca30) ses_callback+0xc1(ffffff08be6eca30) mptsas_pkt_comp+0x2b(ffffff08be6eca30, ffffff08be6ec8a8) mptsas_doneq_thread+0x7f(ffffff0675c8d830) thread_start+8() >
Updated by Igor Kozhukhov about 7 years ago
what is it base system?
how old mpt_sas?
will be better to know latest illumos revision for this crash.
Updated by Marcel Telka about 7 years ago
Igor Kozhukhov wrote:
what is it base system?
how old mpt_sas?
will be better to know latest illumos revision for this crash.
This is happening on a systems with the identical mpt_sas driver sources as it is in the latest illumos, with one exception: the 6299 fix is not backported there.
Updated by Igor Kozhukhov about 7 years ago
Marcel Telka wrote:
Igor Kozhukhov wrote:
what is it base system?
how old mpt_sas?
will be better to know latest illumos revision for this crash.This is happening on a systems with the identical mpt_sas driver sources as it is in the latest illumos, with one exception: the 6299 fix is not backported there.
will be interested in test result based on latest smartos, they have additional patch not upstreamed yet to illumos.
Updated by Dan Fields almost 7 years ago
I can take this one - looks like it was introduced by:
commit 60bf7f6fe2a9d65f5b54de946358dfd8ff1e46e0
Author: Andy Giles <illumos@ang.homedns.org>
Date: Wed Jun 18 16:56:29 2014 +0200
5016 improve mpt_sas auto request sense
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Updated by Dan Fields almost 7 years ago
- Assignee set to Dan Fields
- % Done changed from 0 to 40
- Estimated time set to 36.00 h
The m_req_sense buffer is allocated with kmem_alloc() and setup for DMA.
The extra sense buffer portion of the DMA memory, marked by m_extreq_sense
is managed by rmalloc() in quantums of 0x20 size blocks. The idea
is that rmalloc() will find and allocate contiguous 0x20 hex length blocks
to accommodate the extra sense size. The maximum extra sense is 252 bytes
which rounds to 8 contiguous blocks)
* +-----------------------------------+ m_req_sense * | 0x20 | * +-----------------------------------+ * | . | * | . | * | m_max_requests x 0x20 | * | . | * | . | * +-----------------------------------+m_extreq_sense * | 0x20 |\ * +-----------------------------------+ \ * | 0x20 | | * +-----------------------------------+ | * | 0x20 | | * +-----------------------------------+ |-- managed by rmalloc() * | | | * | | | * | | / * +-----------------------------------+/
I'm not sure how this ever worked properly - the memory at the location at 'm_extreq_sense', which is preallocated,
was never setup in the rmalloc() system, instead the address '1' was given. Then when the call to rmalloc() was made
it was incremented by one - which I think was for the purpose of not using the 'zero' position, but that might have been
a misapplication of how the RMID=0 is reserved for system messages. It doesn't apply to the extra sense buffers which is a
pool to be allocated as needed.
Updated by Dan Fields almost 7 years ago
- Assignee changed from Dan Fields to Marcel Telka
- % Done changed from 40 to 70
Marcel has a fix.
Updated by Marcel Telka almost 7 years ago
- File mptreset.c mptreset.c added
Steps to reproduce
Run the following two scripts concurrently:
# while true ; do sg_raw -r 1k /dev/rdsk/DISK 12 00 00 00 60 00 ; done & # while true ; do ./mptreset DEVCTL ; sleep 1 ; done &
Replacing DISK by the real disk and DEVCTL by the path suggested by mptreset when run without parameters.
With the above the system should panic (almost) immediately.
Updated by Marcel Telka almost 7 years ago
- Status changed from New to In Progress
Updated by Marcel Telka almost 7 years ago
Root cause
The panic happens when there is mptsas_restart_ioc() -> mptsas_init_chip() -> mptsas_alloc_sense_bufs() codepath executed, while there is some outstanding command waiting for a reply. In mptsas_alloc_sense_bufs() we rmfreemap/rmallocmap the resource map, including the space used by the pending command (here the "leaked 8 bytes" warning comes from). Once the pending command receives its reply the space used by the command is freed and we hit the panic with the "bad free" string.
How such a case can ever happen?
2 -> mptsas_scsi_init_pkt 2 -> mptsas_pkt_alloc_extern 2 -> mptsas_cmdarqsize 2 -> rmalloc mpt_sas`mptsas_cmdarqsize+0x88 mpt_sas`mptsas_pkt_alloc_extern+0xa3 mpt_sas`mptsas_scsi_init_pkt+0x5a5 scsi`scsi_init_pkt+0x63 sd`sd_initpkt_for_uscsi+0x8f sd`sd_start_cmds+0xcc sd`sd_core_iostart+0x90 sd`sd_uscsi_strategy+0xfa genunix`default_physio+0x2db genunix`physio+0x25 scsi`scsi_uscsi_handle_cmd+0x29d sd`sd_ssc_send+0x136 sd`sdioctl+0xa96 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x149 2 <- rmalloc 2 <- mptsas_cmdarqsize 2 <- mptsas_pkt_alloc_extern 2 <- mptsas_scsi_init_pkt
The scsi_pkt/mptsas_cmd is allocated without any protection. In mptsas_cmdarqsize() both the DMA memory and a space from the resource map are allocated/assigned to the mptsas_cmd. While the allocation is not protected, it is possible (and it is very likely happening) that the reset comes after the mptsas_cmd allocation, but before the command is executed (via mptsas_scsi_start()).
During the reset both the DMA memory and the resource map are freed and then re-allocated (in mptsas_alloc_sense_bufs()), but the mptsas_cmd allocated above still references now freed DMA memory and now freed space from the resource map. Once the reply to the command arrives, it is stored in the freed DMA memory area. Later, when the command (mptsas_cmd) is freed, the space from the resource map is freed too and this causes the panic, because the resource map was already freed some time ago.
This is very likely causing silent memory corruptions too (in addition to the panic we see). Currently, this is probably not very serious, because if the corruption happens we will panic anyway.
Fix description
The fix is counting the extreq_sense allocations and makes sure it goes back to zero during the HBA reset. This will make sure there are neither DMA buffers for sense data assigned to any allocated packet nor any space allocated from the resource map when the mptsas_alloc_sense_bufs() is called.
Updated by Marcel Telka almost 7 years ago
- Related to Bug #7813: mpt_sas does not like concurrent HBA resets added
Updated by Marcel Telka almost 7 years ago
Updated by Marcel Telka almost 7 years ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk almost 7 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 70 to 100
git commit 2482ae1b96a558eec551575934d5f06c87b807af
commit 2482ae1b96a558eec551575934d5f06c87b807af Author: Marcel Telka <marcel@telka.sk> Date: 2017-02-02T12:31:12.000Z 5698 panic in mpt_sas: vmem_hash_delete(ffffff1aa3456000, 1, 8): bad free Reviewed by: Dan Fields <danfields@fastmail.us> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Updated by Marcel Telka over 3 years ago
- Related to Bug #6182: mpt_sas panic added