Bug #3435
closednfssrv causes bad mutex crash
100%
Description
While using nfs, with what seems to be high open/close file activity causes oi151a7 to crash after a few hours.
This has happened when doing a virus scan over nfs, and doing bulk file processing over nfs. Disabling these jobs, causes it to run days without an issue, but when enabled, the system won't stay up for >24hours without a panic.
Before an upgrade everything was functioning just fine.
http://patrickdk.com/crash.jpg
http://patrickdk.com/crash-nfs2.avi
Also I have attempted many times so far, but the system just won't save a dump report.
I have set:
set dump_plat_mincpu=0
set dump_bzip2_level=1
and dumpadm:
Dump content: kernel pages
Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/stor1
Savecore enabled: yes
Save compressed: off
It did try save compressed: on, also, but attempting without compression now.
savecore always says: savecore: bad magic number 337c7900
The system has 16gigs of ram and dump is
rpool/dump 30.0G 54.3G 30.0G -
Related issues
Updated by Patrick Domack over 10 years ago
Ok, I finally managed to get a crash dump saved.
Updated by David Höppner over 10 years ago
Project should be "illumos-gate" not "site".
Updated by Rich Lowe over 10 years ago
- Project changed from site to illumos gate
Updated by Marcel Telka over 10 years ago
The panic stack is:
> ::stack vpanic() mutex_panic+0x73(fffffffffb961bd0, ffffff05abfa2908) mutex_vector_enter+0x446(ffffff05abfa2908) exi_hold+0x26(ffffff05abfa2800) checkexport+0xac(ffffff00200e17d4, ffffff00200e17f8) common_dispatch+0x20a(ffffff00200e1b00, ffffff07610a0a00, 2, 4, fffffffff8ff9d68, ffffffffc03913d0) rfs_dispatch+0x2d(ffffff00200e1b00, ffffff07610a0a00) svc_getreq+0x19c(ffffff07610a0a00, ffffff07b9226c20) svc_run+0x171(ffffff05ae996aa0) svc_do_run+0x81(1) nfssys+0x765(e, febf0fbc) _sys_sysenter_post_swapgs+0x149() >
Updated by Marcel Telka over 10 years ago
The problem is that we do have the already freed exportinfo in the exptable:
> ::walk nfs_expinfo|::whatis ffffff05748a7800 is allocated from kmem_alloc_640 ffffff05748a7d00 is allocated from kmem_alloc_640 ffffff05ac1ea280 is allocated from kmem_alloc_640 ffffff05ac1eaa00 is allocated from kmem_alloc_640 ffffff057766c280 is allocated from kmem_alloc_640 ffffff05ac031d80 is allocated from kmem_alloc_640 ffffff05abfb50c0 is allocated from kmem_alloc_640 ffffff05ae4b85c0 is allocated from kmem_alloc_640 ffffff05ae4b8d40 is allocated from kmem_alloc_640 ffffff0577b3b540 is allocated from kmem_alloc_640 ffffff0577b3ba40 is allocated from kmem_alloc_640 ffffff05748bf780 is allocated from kmem_alloc_640 ffffff05abfa2800 is freed from kmem_alloc_640 >
Updated by Rich Lowe over 10 years ago
Yeah, Dan had an analysis for that too, but I guess it didn't make the bug?
If you look at where it's freed, there's a pretty obvious race, if I recall correctly.
Updated by Marcel Telka over 10 years ago
Rich, are you talking about #3449? If so, then I believe this issue is different.
Updated by Marcel Telka over 10 years ago
Updated by Marcel Telka over 10 years ago
- Status changed from New to Closed
- Assignee set to Marcel Telka
- % Done changed from 0 to 100
Updated by Marcel Telka over 10 years ago
Root cause:
In case the rfs3_lookup() is called with exi != NULL and the exi is not EX_PUBLIC and the nfs3_fhtovp() failed, we will do the 'goto out' here:
401 if (dvp == NULL) { 402 error = ESTALE; 403 goto out; 404 }
and rele the exi hold we never had here:
551out: 552 /* 553 * The passed argument exportinfo is released by the 554 * caller, common_dispatch 555 */ 556 exi_rele(exi);
Updated by Patrick Domack over 10 years ago
I've been running the patch above for alittle over a week, and it just crashed again, the exact same way.
TIME UUID SUNW-MSG-ID
Feb 01 2013 15:56:30.766618000 30d3d132-8f2f-6e56-ba11-d499d8165bf0 SUNOS-8000-KL
TIME CLASS ENA
Feb 01 15:56:30.7053 ireport.os.sunos.panic.savecore_failure 0x0000000000000000
Feb 01 15:56:21.2858 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 30d3d132-8f2f-6e56-ba11-d499d8165bf0
code = SUNOS-8000-KL
diag-time = 1359752190 741296
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list0)
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = sw:///:path=/var/crash/stor1/.30d3d132-8f2f-6e56-ba11-d499d8165bf0
resource = sw:///:path=/var/crash/stor1/.30d3d132-8f2f-6e56-ba11-d499d8165bf0
savecore-succcess = 0
failure-reason = stream tag 883 not in range 1..1
os-instance-uuid = 30d3d132-8f2f-6e56-ba11-d499d8165bf0
panicstr = mutex_enter: bad mutex, lp=ffffff04ff9edb48 owner=ffffff04ffe66c60 thread=ffffff04ffe76b40
panicstack = unix:mutex_panic+73 () | unix:mutex_vector_enter+367 () | nfssrv:exi_hold+26 () | nfssrv:checkexport+ac () | nfssrv:common_dispatch+250 () | nfssrv:rfs_dispatch+2d () | rpcmod:svc_getreq+1c1 () | rpcmod:svc_run+146 () | rpcmod:svc_do_run+8e () | nfs:nfssys+f1 () | unix:brand_sys_sysenter+1c9 () |
crashtime = 1359752065
panic-time = February 1, 2013 03:54:25 PM EST EST
(end fault-list0)
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x510c2bfe 0x2db1a990
Updated by Yuri Pankov over 10 years ago
- Status changed from Closed to Feedback
Reopened per submitter's request.
Updated by Marcel Telka over 10 years ago
Please provide the crash dump file. Thanks.
Updated by Patrick Domack over 10 years ago
TIME UUID SUNW-MSG-ID
Feb 01 2013 19:34:26.860243000 31798f2a-2114-cf2c-9c02-e4b2c11053dd SUNOS-8000-KL
TIME CLASS ENA
Feb 01 19:34:23.0040 ireport.os.sunos.panic.dump_available 0x0000000000000000
Feb 01 19:34:12.3230 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 31798f2a-2114-cf2c-9c02-e4b2c11053dd
code = SUNOS-8000-KL
diag-time = 1359765266 669178
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list0)
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = sw:///:path=/var/crash/stor1/.31798f2a-2114-cf2c-9c02-e4b2c11053dd
resource = sw:///:path=/var/crash/stor1/.31798f2a-2114-cf2c-9c02-e4b2c11053dd
savecore-succcess = 1
dump-dir = /var/crash/stor1
dump-files = unix.1 vmcore.1
os-instance-uuid = 31798f2a-2114-cf2c-9c02-e4b2c11053dd
panicstr = mutex_enter: bad mutex, lp=ffffff050d3e0148 owner=ffffff04fbe547c0 thread=ffffff050d2d2120
panicstack = unix:mutex_panic+73 () | unix:mutex_vector_enter+367 () | nfssrv:exi_hold+26 () | nfssrv:checkexport+ac () | nfssrv:common_dispatch+250 () | nfssrv:rfs_dispatch+2d () | rpcmod:svc_getreq+1c1 () | rpcmod:svc_run+146 () | rpcmod:svc_do_run+8e () | nfs:nfssys+f1 () | unix:brand_sys_sysenter+1c9 () |
crashtime = 1359765138
panic-time = February 1, 2013 07:32:18 PM EST EST
(end fault-list0)
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x510c5f12 0x33464438
And this time with crash dump savecore.
Updated by Marcel Telka over 10 years ago
The http://patrickdk.com/crash2.tar.gz contains these files:
$ ls -l total 352361 -rw-r--r-- 1 marcel staff 2948089 2012-12-31 03:42 unix.0 -rw-r--r-- 1 marcel staff 357564416 2013-02-02 01:34 vmcore.1 $
So there is missing the unix.1 file. Please provide it. Thanks.
Updated by Patrick Domack over 10 years ago
patrickdk.com/unix.1.gz
Ok, here is a another crash.
patrickdk.com/crash3.tar.gz
> ::status debugging crash dump vmcore.2 (64-bit) from stor1 operating system: 5.11 oi_151a7b (i86pc) image uuid: 4aa47629-96cb-edbd-e782-d83b465c1762 panic message: mutex_enter: bad mutex, lp=ffffff0503534dc8 owner=ffffff05026bf840 thread=ffffff05026d5480 dump content: kernel pages only > ::stack vpanic() mutex_panic+0x73(fffffffffb933bf5, ffffff0503534dc8) mutex_vector_enter+0x367(ffffff0503534dc8) exi_hold+0x26(ffffff0503534cc0) checkexport+0xac(ffffff001f6f8aa4, ffffff001f6f8ac8) common_dispatch+0x250(ffffff001f6f8ca0, ffffff0505d49400, 2, 4, fffffffff8416a16, ffffffffc0158060) rfs_dispatch+0x2d(ffffff001f6f8ca0, ffffff0505d49400) svc_getreq+0x1c1(ffffff0505d49400, ffffff04f6dc2d60) svc_run+0x146(ffffff0505f21e78) svc_do_run+0x8e(1) nfssys+0xf1(e, fecc0fbc) _sys_sysenter_post_swapgs+0x149() > ::walk nfs_expinfo|::whatis ffffff0502a4a100 is allocated from kmem_alloc_640 ffffff050530a2c0 is allocated from kmem_alloc_640 ffffff050530a540 is allocated from kmem_alloc_640 ffffff0502a4a600 is allocated from kmem_alloc_640 ffffff0502a4ad80 is allocated from kmem_alloc_640 ffffff050358c280 is allocated from kmem_alloc_640 ffffff0505598100 is allocated from kmem_alloc_640 ffffff0505598d80 is allocated from kmem_alloc_640 ffffff05028fc280 is allocated from kmem_alloc_640 ffffff05028fc500 is allocated from kmem_alloc_640 ffffff05028fc780 is allocated from kmem_alloc_640 ffffff050285ca40 is allocated from kmem_alloc_640 ffffff0503534cc0 is freed from kmem_alloc_640
Updated by Marcel Telka over 10 years ago
- Status changed from Feedback to Closed
Analysis of patrickdk.com/crash3.tar.gz
> rfs3_lookup::dis rfs3_lookup: pushq %rbp rfs3_lookup+1: movq %rsp,%rbp rfs3_lookup+4: subq $0x210,%rsp rfs3_lookup+0xb: testq %rdx,%rdx rfs3_lookup+0xe: movq %r12,-0x50(%rbp) rfs3_lookup+0x12: movq %r14,-0x40(%rbp) rfs3_lookup+0x16: movq %rdx,%r12 rfs3_lookup+0x19: movq %r15,-0x38(%rbp) rfs3_lookup+0x1d: movq %rbx,-0x58(%rbp) rfs3_lookup+0x21: movq %rdi,%r14 rfs3_lookup+0x24: movq %r13,-0x48(%rbp) rfs3_lookup+0x28: movq %r8,-0x28(%rbp) rfs3_lookup+0x2c: movq %rsi,%r15 rfs3_lookup+0x2f: movq %rcx,-0x20(%rbp) rfs3_lookup+0x33: movq %rdx,-0x18(%rbp) rfs3_lookup+0x37: movq %rsi,-0x10(%rbp) rfs3_lookup+0x3b: movq %rdi,-0x8(%rbp) rfs3_lookup+0x3f: movq %rcx,0xfffffffffffffe60(%rbp) rfs3_lookup+0x46: movq %r8,0xfffffffffffffe68(%rbp) rfs3_lookup+0x4d: movl $0x0,-0x70(%rbp) rfs3_lookup+0x54: movl $0x0,-0x6c(%rbp) rfs3_lookup+0x5b: je +0xa <rfs3_lookup+0x67> rfs3_lookup+0x5d: testb $0x4,0x18(%rdx) rfs3_lookup+0x61: jne +0x2f9 <rfs3_lookup+0x360> rfs3_lookup+0x67: leaq 0x8(%r14),%rdi rfs3_lookup+0x6b: movq %r12,%rsi rfs3_lookup+0x6e: call +0xd0cd <nfs3_fhtovp>
Your kernel does not contain the fix at https://www.illumos.org/attachments/821/nfs-regress-fix.patch (attached to Bug #3449).
That fix adds a call to exi_hold() somewhere before this line:
rfs3_lookup+0x5b: je +0xa <rfs3_lookup+0x67>
but we do not see such call.
Updated by Patrick Domack over 10 years ago
- Status changed from Closed to Feedback
Ah, that is the case, I had used the posted patch, but the attached patch is different.
Updated by Marcel Telka over 10 years ago
- Status changed from Feedback to Closed
Patrick Domack wrote:
Ah, that is the case, I had used the posted patch, but the attached patch is different.
I'm not sure what exactly is the posted patch, but the simplest approach for you might be to just backout the #2986 fix. In this case you'll not need to deal with any not-finished-yet partial fixes.
The issue described in #2986 is not so severe (I'm not aware about any case when anybody hit it), so you can live safely without a fix for the exi refcounter leak.
Why the move from Closed to Feedback? Please do not do that in the future without any justification. This bug (#3435) is a duplicate of #3449.
Updated by Marcel Telka about 10 years ago
- Category set to nfs - NFS server and client