Project

General

Profile

Bug #5538

Kernel panic on SAS disk failure

Added by Frank Schreuder almost 5 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
2015-01-14
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Our storage server suddenly reboots after a disk fails and is removed from the zpool. We encountered these crashes multiple times now.
I assume this problem happens in the mpt_sas driver due to a race condition.
A core-dump was acquired and the system crashed with the following panic:

panic[cpu1]/thread=ffffff001ea77c40:
BAD TRAP: type=d (#gp General protection) rp=ffffff001ea77930 addr=ffffff0506a9a036

sched:
#gp General protection
addr=0xffffff0506a9a036
pid=0, pc=0xfffffffff7ab1341, sp=0xffffff001ea77a20, eflags=0x10286
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 406f8<osxsav,xmme,fxsr,pge,mce,pae,pse,de>
cr2: fa7d0fec
cr3: 3c00000
cr8: 0

rdi: ffffff072da234e8 rsi: ffffff04e4f90600 rdx:         f7aa572f
rcx: ffffff05064c49c0  r8: ffffff04dd0b5a78  r9:                0
rax: 75894810ec8348e5 rbx: ffffff05064c49b0 rbp: ffffff001ea77a80
r10:                1 r11: ffffff04e5764040 r12: ffffff0506a9a000
r13:   15f40508d3f322 r14: ffffff0506a9a038 r15: ffffff0506a9a036
fsb:                0 gsb: ffffff04e5764040  ds:               4b
es:               4b  fs:                0  gs:              1c3
trp:                d err:                0 rip: fffffffff7ab1341
cs:               30 rfl:            10286 rsp: ffffff001ea77a20
ss:               38

ffffff001ea77810 unix:real_mode_stop_cpu_stage2_end+9e73 ()
ffffff001ea77920 unix:trap+a30 ()
ffffff001ea77930 unix:cmntrap+e6 ()
ffffff001ea77a80 mpt_sas:mptsas_watchsubr+111 ()
ffffff001ea77ab0 mpt_sas:mptsas_watch+96 ()
ffffff001ea77b00 genunix:callout_list_expire+98 ()
ffffff001ea77b30 genunix:callout_expire+3b ()
ffffff001ea77b60 genunix:callout_execute+20 ()
ffffff001ea77c20 genunix:taskq_thread+2d0 ()
ffffff001ea77c30 unix:thread_start+8 ()

In the (not timestamped) message log, the following is seen before the panic:

/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Timeout of 5 seconds expired with 21 commands on target 11 lun 0.
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Disconnected command timeout for target 11 w50025385503bdc88.
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31130000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31130000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31140000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
WARNING: /scsi_vhci (scsi_vhci0):
/scsi_vhci/disk@g50025385503bdc88 (sd7): Command Timeout on path mpt_sas9/disk@w50025385503bdc88,0
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Timeout of 5 seconds expired with 1 commands on target 11 lun 0.
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Disconnected command timeout for target 11 w50025385503bdc88.
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31130000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31111000
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31111000
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Timeout of 5 seconds expired with 4 commands on target 11 lun 0.
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Disconnected command timeout for target 11 w50025385503bdc88.
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31111000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31111000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31111000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31111000 received for target 11 w50025385503bdc88.
scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4b IOCLogInfo=0x0 target=11

WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mptsas_ioc_task_management failed try to reset ioc to recovery!
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mpt2 Firmware version v18.0.0.0 (?)
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mpt2 MPI Version 0x200
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mpt2: IOC Operational.
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31111000
/pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Log info 0x31111000 received for target 11 p0.
scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
mptsas request inquiry page 0x83 for target:b, lun:0 failed!
WARNING: /pci@0,0/pci8086,1d18@1c,4/pci1000,3020@0 (mpt_sas2):
Target 11 reset for command timeout recovery failed!

I assume this is the way it happens:
The mpt_sas driver sets a watch with a timer in a thread (thread A).
In the mean time a disk times out on its commands a few times (for reasons unknown) and zfs detaches the disk from the system (to add the spare) (thread B).
The timer in "thread A" expires and it tries to read from the bus (from the disk that was detached) and gets TRAPped.

Possible related to issue #5537 ?


Related issues

Related to illumos gate - Bug #5208: Edge case in mpt_sas causing panicNew2014-10-03

Actions

History

#1

Updated by Rich Ercolani almost 5 years ago

cf. #5208 for other people reporting this issue.

#2

Updated by Marcel Telka almost 3 years ago

  • Related to Bug #5208: Edge case in mpt_sas causing panic added
#3

Updated by Marcel Telka almost 3 years ago

  • Category set to driver - device drivers

Also available in: Atom PDF