Bug #1872
closedNULL pointer dereference in iscsit
100%
Description
Running oi_151a on a Sun X4100M2 with COMSTAR and ZFS active. Kernel faulted with:
BAD TRAP: type=e (#pf Page fault) rp=ffffff00129c8a10 addr=8d8 occurred in module "unix" due to a NULL pointer dereference
Can't really pinpoint the exact cause of the crash or what was running while the crash was occurring. This system is heavily used for NFS, COMSTAR, and ZFS.
Crash file available at http://support.thewrittenword.com/vmdump.0.
Kernel messages and stack information available at http://support.thewrittenword.com/crash.0. This consists of the output from running:
echo '::panicinfo\n::cpuinfo -v\n::threadlist -v 10\n::msgbuf\n*panic_thread::findstack -v\n::stacks' | mdb 0 > /tmp/crash.0
# psrinfo -v Status of virtual processor 0 as of: 12/09/2011 19:10:00 on-line since 12/09/2011 03:17:00. The i386 processor operates at 1800 MHz, and has an i387 compatible floating point processor. Status of virtual processor 1 as of: 12/09/2011 19:10:00 on-line since 12/09/2011 03:17:02. The i386 processor operates at 1800 MHz, and has an i387 compatible floating point processor. # prtconf | grep Memory Memory size: 8192 Megabytes # mdb 0 Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs mpt sd ip hook neti sockfs arp usba qlc fctl stmf stmf_sbd md lofs fcp random idm crypto cpc fcip ufs logindmux ptm sppp smbsrv nfs nsmb ] > *panic_thread::findstack -v stack pointer for thread ffffff00129c8c40: ffffff00129c6e70 ffffff00129c8770 0xffffff03076ec250() ffffff00129c8860 panic+0x94() ffffff00129c88f0 die+0xdd(e, ffffff00129c8a10, 8d8, 0) ffffff00129c8a00 trap+0x1799(ffffff00129c8a10, 8d8, 0) ffffff00129c8a10 0xfffffffffb8001d6() ffffff00129c8b40 mutex_enter+0xb() ffffff00129c8b70 iscsit_deferred+0x36(ffffff0313beb090) ffffff00129c8c20 taskq_thread+0x285(ffffff030756bd50) ffffff00129c8c30 thread_start+8() > ::panicinfo cpu 0 thread ffffff00129c8c40 message BAD TRAP: type=e (#pf Page fault) rp=ffffff00129c8a10 addr=8d8 occurred in modul e "unix" due to a NULL pointer dereference rdi 8d8 rsi ffffff0313beb1f8 rdx ffffff00129c8c40 rcx 9b80fd00 r8 ffffff036f674000 r9 ffffff032e8c51e0 rax 0 rbx ffffff0306d29000 rbp ffffff00129c8b40 r10 56ceaa8e r10 56ceaa8e r11 8 r12 ffffff0313beb230 r13 ffffff0313beb090 r14 8d8 r15 0 fsbase 0 gsbase fffffffffbc304a0 ds 4b es 4b fs 0 gs 1c3 trapno e err 2 rip fffffffffb86ccfb cs 30 rflags 10246 rsp ffffff00129c8b08 ss 0 gdt_hi 0 gdt_lo e00001ef idt_hi 0 idt_lo d0000fff ldt 0 task 70 cr0 8005003b cr2 8d8 cr3 8c00000 cr4 6f8
Updated by Rich Lowe over 11 years ago
Looking very briefly, it looks like we're really here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/comstar/port/iscsit/iscsit.c#3100
with ist NULL
Updated by Rich Lowe over 11 years ago
All of this is conjecture, I understand none of iscsi, the target, nor comstar.
I think we're processing deferred commands for a session, the threadlist makes it look as if we've also lost contact with a lot of clients (or sessions? or something?) and are tearing down their state. The locking for this looks mostly right (it's happening under the lock we take).
That, however, makes me wonder if we've entirely ripped a session away from under the deferred processing taskq, and that that's why our session is NULL. (kmem logs sure would help, but oh well).
It seems possible that this can happen, because
idm_conn_unref is called via taskq from CS_S11_COMPLETE of idm_update_state, immediately after notifying of the CONNECT_LOST.
We do try to abort any tasks related to this client: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/idm/idm_conn_sm.c#1291
Perhaps we don't succeed? Perhaps we lock insufficiently?
Updated by Dan McDonald over 10 years ago
We're seeing this now from a Nexenta customer. Same situation (lots of closing iSCSI threads). From my coredump (can't share due to customer data), the iSCSI target connection login state looks like one that was freshly allocated:
ict_login_sm = {
icl_login_state = 1 (ILS_LOGIN_INIT)
icl_login_last_state = 1 (ILS_LOGIN_INIT)
but hasn't had a session:
ict_sess = 0
attached yet. No idea if this has been able to be easily reproduce (and nobody has produced one with kmem_flags = 0xf either, I believe).
Updated by Dan McDonald over 9 years ago
- Category set to comstar - iSCSI/FC/SAS target
- Assignee set to Dan McDonald
- % Done changed from 0 to 60
We have reproduced a variant of this in-house at Nexenta. (BTW, kmem_flags = 0xf didn't help.)
Basically, an iSCSI PDU that arrives before login can access state that isn't initialized until login occurs.
Updated by Dan McDonald over 9 years ago
- Subject changed from NULL pointer dereference in kernel to NULL pointer dereference in iscsit
Updated by Robert Mustacchi over 9 years ago
- Status changed from New to Resolved
- % Done changed from 60 to 100
- Tags deleted (
needs-triage)
Resolved in 241996a3debb0b7690aef018488c76770d83311e.