Bug #5538
closedKernel panic on SAS disk failure
0%
Description
Our storage server suddenly reboots after a disk fails and is removed from the zpool. We encountered these crashes multiple times now.
I assume this problem happens in the mpt_sas driver due to a race condition.
A core-dump was acquired and the system crashed with the following panic:
panic[cpu1]/thread=ffffff001ea77c40: BAD TRAP: type=d (#gp General protection) rp=ffffff001ea77930 addr=ffffff0506a9a036 sched: #gp General protection addr=0xffffff0506a9a036 pid=0, pc=0xfffffffff7ab1341, sp=0xffffff001ea77a20, eflags=0x10286 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 406f8<osxsav,xmme,fxsr,pge,mce,pae,pse,de> cr2: fa7d0fec cr3: 3c00000 cr8: 0 rdi: ffffff072da234e8 rsi: ffffff04e4f90600 rdx: f7aa572f rcx: ffffff05064c49c0 r8: ffffff04dd0b5a78 r9: 0 rax: 75894810ec8348e5 rbx: ffffff05064c49b0 rbp: ffffff001ea77a80 r10: 1 r11: ffffff04e5764040 r12: ffffff0506a9a000 r13: 15f40508d3f322 r14: ffffff0506a9a038 r15: ffffff0506a9a036 fsb: 0 gsb: ffffff04e5764040 ds: 4b es: 4b fs: 0 gs: 1c3 trp: d err: 0 rip: fffffffff7ab1341 cs: 30 rfl: 10286 rsp: ffffff001ea77a20 ss: 38 ffffff001ea77810 unix:real_mode_stop_cpu_stage2_end+9e73 () ffffff001ea77920 unix:trap+a30 () ffffff001ea77930 unix:cmntrap+e6 () ffffff001ea77a80 mpt_sas:mptsas_watchsubr+111 () ffffff001ea77ab0 mpt_sas:mptsas_watch+96 () ffffff001ea77b00 genunix:callout_list_expire+98 () ffffff001ea77b30 genunix:callout_expire+3b () ffffff001ea77b60 genunix:callout_execute+20 () ffffff001ea77c20 genunix:taskq_thread+2d0 () ffffff001ea77c30 unix:thread_start+8 ()
In the (not timestamped) message log, the following is seen before the panic:
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Timeout of 5 seconds expired with 21 commands on target 11 lun 0. WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Disconnected command timeout for target 11 w50025385503bdc88. /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31130000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31130000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31140000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc WARNING: /scsi_vhci (scsi_vhci0): /scsi_vhci/disk@g50025385503bdc88 (sd7): Command Timeout on path mpt_sas9/disk@w50025385503bdc88,0 /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Timeout of 5 seconds expired with 1 commands on target 11 lun 0. WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Disconnected command timeout for target 11 w50025385503bdc88. /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31130000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31111000 WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31111000 /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Timeout of 5 seconds expired with 4 commands on target 11 lun 0. WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Disconnected command timeout for target 11 w50025385503bdc88. /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31111000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31111000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31111000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31111000 received for target 11 w50025385503bdc88. scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4b IOCLogInfo=0x0 target=11 WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mptsas_ioc_task_management failed try to reset ioc to recovery! /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mpt2 Firmware version v18.0.0.0 (?) /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mpt2 MPI Version 0x200 /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mpt2: IOC Operational. WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31111000 /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Log info 0x31111000 received for target 11 p0. scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): mptsas request inquiry page 0x83 for target:b, lun:0 failed! WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2): Target 11 reset for command timeout recovery failed!
I assume this is the way it happens:
The mpt_sas driver sets a watch with a timer in a thread (thread A).
In the mean time a disk times out on its commands a few times (for reasons unknown) and zfs detaches the disk from the system (to add the spare) (thread B).
The timer in "thread A" expires and it tries to read from the bus (from the disk that was detached) and gets TRAPped.
Possible related to issue #5537 ?
Related issues
Updated by Rich Ercolani over 7 years ago
cf. #5208 for other people reporting this issue.
Updated by Marcel Telka over 5 years ago
- Related to Bug #5208: Edge case in mpt_sas causing panic added
Updated by Marcel Telka over 5 years ago
- Category set to driver - device drivers
Updated by Marcel Telka 7 months ago
- Status changed from New to Duplicate
This is duplicate of #5208.
Updated by Marcel Telka 7 months ago
- Is duplicate of Bug #5208: Edge case in mpt_sas causing panic added
Updated by Marcel Telka 7 months ago
- Related to deleted (Bug #5208: Edge case in mpt_sas causing panic)