Project

General

Profile

Bug #5492

Panic possibly in iscsi initiator

Added by Keith Hall about 6 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2014-12-28
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

Hi,

I'm relatively new to this so please bear with me.

attached is a crashdump from my system which panic'd earlier this morning, uptime around 4.5 days...

system hardware:
Supermicro X10SL7F [SAS firmware 19 IT mode]
8 x Seagate ST1200MM0007 1.2TB 10k SAS - 7 in raidz2 pool, 1 spare
32GB ECC RAM
IBM M1015 HBA SAS firmware 9211-8 P19 IT mode attached:
2 x Crucial M550 128GB SSD in pool as cache
1 x Seagate ST3000DM001 3TB SATA as backup pool

Running ESX-i 5.5 with VT-d passthrough of SAS controllers to VM running oi_151a8
2CPU 6GB RAM

Was running ZIL on SLOG of the 2 crucial SSD but sustained write performance was rubbish

I didn't want to run sync=disabled, but wanted good performance writes - couldn't afford any more dedicated SLOG hardware devices (a DDRdriveX1 would be sooo nice!) so thought about a ramdisk but didn't want it on the same OS...

As this is an all in one box running on UPS, if the whole box died, it probably wouldn't matter to any running VMs about some data loss, but if the OI server died/crashed leaving VMs running, it wouldn't be good... so to offer SOME protection against data loss I thought I'll create another VM purely as a RAM based SLOG device over iSCSI.

So we have a 'zil' VM with 4GB RAM running linux tgtd, exporting a 3GB SLOG device over a separate VLAN with MTU 9000

The performance of this is pretty great, around 300-400MB/s SYNC writes

It has been working fine up until about 8:30 this morning when there was a panic.. the only difference is there has been a fairly steady write load on the system over the past 18 hours or so, and if there was some issue writing to the iscsi device then perhaps that could have caused the crash.

I don't really know for sure though so that's why I'm uploading crash dump here...

Hope this is of some use and someone is able to help me drill down this issue further.

ZILSTAT leading up to the crash:

TIME N-MB N-MB/s N-Max-Rate B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB
2014 Dec 28 08:23:41 7 0 1 37 1 8 443 0 6 437
2014 Dec 28 08:24:11 6 0 1 31 1 8 415 0 0 415
2014 Dec 28 08:24:41 6 0 1 32 1 8 421 0 6 415
2014 Dec 28 08:25:11 7 0 1 30 1 5 442 0 0 442
2014 Dec 28 08:25:41 7 0 1 35 1 9 469 0 8 461
2014 Dec 28 08:26:11 6 0 1 35 1 8 420 0 2 418
2014 Dec 28 08:26:41 7 0 1 33 1 8 456 0 5 451
2014 Dec 28 08:27:11 7 0 2 33 1 9 431 0 1 430
2014 Dec 28 08:27:41 10 0 5 42 1 12 499 0 5 494
2014 Dec 28 08:28:11 7 0 1 37 1 9 473 0 1 472
TIME N-MB N-MB/s N-Max-Rate B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB
2014 Dec 28 08:28:41 6 0 1 30 1 6 401 0 0 401
2014 Dec 28 08:29:11 7 0 1 33 1 8 436 0 6 430
2014 Dec 28 08:29:41 7 0 1 35 1 6 439 0 0 439

I'm sure I've left something out so if any further info required please shout :)

Kind regards,

Keith.


Files

crash.0.gz (53.6 KB) crash.0.gz Keith Hall, 2014-12-28 11:17 AM
#1

Updated by Keith Hall about 6 years ago

Whether it's causal or coincidence - this also appears in the log prior to the crash

Dec 28 06:07:35 box scsi: [ID 107833 kern.warning] WARNING: %3Aramzil0001,1 (sd12):
Dec 28 06:07:35 box incomplete write- retrying
Dec 28 08:11:20 box scsi: [ID 107833 kern.warning] WARNING: %3Aramzil0001,1 (sd12):
Dec 28 08:11:20 box incomplete write- retrying

subsequently

Dec 28 08:30:01 box unix: [ID 836849 kern.notice]
Dec 28 08:30:01 box ^Mpanic[cpu0]/thread=ffffff025be08760:
Dec 28 08:30:01 box genunix: [ID 683410 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ffffff000e8a7590 addr=ffffff02cbffa380
Dec 28 08:30:01 box unix: [ID 100000 kern.notice]
Dec 28 08:30:01 box unix: [ID 839527 kern.notice] zpool:
Dec 28 08:30:01 box unix: [ID 753105 kern.notice] #pf Page fault
Dec 28 08:30:01 box unix: [ID 532287 kern.notice] Bad kernel fault at addr=0xffffff02cbffa380
Dec 28 08:30:01 box unix: [ID 243837 kern.notice] pid=29757, pc=0xfffffffffb85ade8, sp=0xffffff000e8a7688, eflags=0x10246
Dec 28 08:30:01 box unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 406b8<osxsav,xmme,fxsr,pge,pae,pse,de>
Dec 28 08:30:01 box unix: [ID 624947 kern.notice] cr2: ffffff02cbffa380
Dec 28 08:30:01 box unix: [ID 625075 kern.notice] cr3: 1be437000
Dec 28 08:30:01 box unix: [ID 625715 kern.notice] cr8: c
Dec 28 08:30:01 box unix: [ID 100000 kern.notice]
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] rdi: ffffff02cbffa380 rsi: 40 rdx: ffffff025be08760
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] rcx: 40 r8: 9 r9: ffffff02fc133e80
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] rax: 0 rbx: 2 rbp: ffffff000e8a76c0
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] r10: fffffffffb85ac08 r11: 0 r12: 80
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] r13: ffffff02cbffa380 r14: ffffff024ec282c8 r15: fffffffff7b31870
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] fsb: 0 gsb: fffffffffbc30640 ds: 4b
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] es: 4b fs: 0 gs: 1c3
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] trp: e err: 2 rip: fffffffffb85ade8
Dec 28 08:30:01 box unix: [ID 592667 kern.notice] cs: 30 rfl: 10246 rsp: ffffff000e8a7688
Dec 28 08:30:01 box unix: [ID 266532 kern.notice] ss: 38
Dec 28 08:30:01 box unix: [ID 100000 kern.notice]
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7460 unix:die+100 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7580 unix:trap+17db ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7590 unix:cmntrap+e6 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a76c0 unix:bzero+358 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7700 sd:sd_uscsi_strategy+119 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a77c0 scsi:scsi_uscsi_handle_cmd+188 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7830 sd:sd_ssc_send+1fd ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a78e0 sd:sd_send_scsi_TEST_UNIT_READY+10b ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7960 sd:sd_ready_and_valid+211 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a79f0 sd:sdopen+2d5 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7a20 genunix:dev_open+3c ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7ad0 specfs:spec_open+5dc ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7b40 genunix:fop_open+bf ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7cf0 genunix:vn_openat+6ce ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7e60 genunix:copen+49e ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7e90 genunix:openat32+27 ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7ec0 genunix:open32+2e ()
Dec 28 08:30:01 box genunix: [ID 655072 kern.notice] ffffff000e8a7f10 unix:brand_sys_sysenter+1c9 ()
Dec 28 08:30:01 box unix: [ID 100000 kern.notice]
Dec 28 09:53:49 box unix: [ID 836849 kern.notice]
Dec 28 09:53:49 box ^Mpanic[cpu0]/thread=ffffff025be08760:
Dec 28 09:53:49 box genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=fffffffffbc79a40 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference

Also available in: Atom PDF