Bug #3449
closedSystem panics after zfs rollback
0%
Description
with sharenfs enabled for a filesystem, I ran a ZFS rollback. The system promptly panicked. From the console:
panic[cpu26]/thread=ffffff4777493120: mutex_destroy: not owner, lp=ffffff4349dd03c8 owner=ffffff46e3fb0be0 thread=ffffff4777493120 ffffff01eb948870 unix:mutex_panic+73 () ffffff01eb9488a0 unix:mutex_destroy+60 () ffffff01eb9488d0 nfssrv:exportfree+f4 () ffffff01eb948900 nfssrv:exi_rele+60 () ffffff01eb948c20 nfssrv:common_dispatch+c5 () ffffff01eb948c40 nfssrv:rfs_dispatch+2d () ffffff01eb948d20 rpcmod:svc_getreq+1c1 () ffffff01eb948d90 rpcmod:svc_run+146 () ffffff01eb948dd0 rpcmod:svc_do_run+8e () ffffff01eb948ec0 nfs:nfssys+f1 () ffffff01eb948f10 unix:brand_sys_sysenter+1c9 () panic: entering debugger (continue to save dump) Welcome to kmdb kmdb: unable to determine terminal type: assuming `vt100' Loaded modules: [ scsi_vhci stmf_sbd crypto mac cpc uppc neti sd ptm ufs unix zfs krtld s1394 apix uhci hook lofs genunix idm ip logindmux nsmb usba specfs pcplusmp nfs md random cpu.generic arp mpt_sas stmf sockfs smbsrv ] [26]> :c syncing file systems... done dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Since this initial panic, the system will panic shortly after all ZFS filesystems are mounted.
From fmdump:
TIME UUID SUNW-MSG-ID Jan 04 2013 14:41:35.633638000 7034218a-9142-4785-a4dd-c5eb0fe9748c SUNOS-8000-KL TIME CLASS ENA Jan 04 14:41:35.5759 ireport.os.sunos.panic.dump_available 0x0000000000000000 nvlist version: 0 version = 0x0 class = list.suspect uuid = 7034218a-9142-4785-a4dd-c5eb0fe9748c code = SUNOS-8000-KL diag-time = 1357328495 587499 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash/unknown/.7034218a-9142-4785-a4dd-c5eb0fe9748c resource = sw:///:path=/var/crash/unknown/.7034218a-9142-4785-a4dd-c5eb0fe9748c savecore-succcess = 1 dump-dir = /var/crash/unknown dump-files = vmdump.1 os-instance-uuid = 7034218a-9142-4785-a4dd-c5eb0fe9748c panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff01ec1295c0 addr=108 occurred in module "unix" due to a NULL pointer dereference panicstack = unix:die+df () | unix:trap+db3 () | unix:cmntrap+e6 () | unix:mutex_enter+b () | nfssrv:rfs3_lookup+508 () | nfssrv:common_dispatch+539 () | nfssrv:rfs_dispatch+2d () | rpcmod:svc_getreq+1c1 () | rpcmod:svc_run+146 () | rpcmod:svc_do_run+8e () | nfs:nfssys+f1 () | unix:brand_sys_sysenter+1c9 () | crashtime = 1357323719 panic-time = Fri Jan 4 13:21:59 2013 EST (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x50e7306f 0x25c48c70
Core files are from one of the first panics today, the second is from the most recent panic.
http://share.hpcc.msu.edu/users/gmason/vmdump.0
http://share.hpcc.msu.edu/users/gmason/vmdump.1
Files
Related issues
Updated by Greg Mason over 10 years ago
In case it's of any use, here's another core:
Updated by Dan McDonald over 10 years ago
Thank you!
vmdump.1 proved useful.
Look at rfs3_lookup() for a case where exi (export information) is NULL. The function mostly seems to handle it, and it hits here:
401 if (dvp == NULL) { 402 error = ESTALE; 403 goto out; 404 }
If you look at out:
551out: 552 /* 553 * The passed argument exportinfo is released by the 554 * caller, common_dispatch 555 */ 556 exi_rele(exi); 557
The comment seems to indicate that the caller would release the exi. It does, BUT there's an additional hold in some places thanks to a bugfix.
The bugfix in question is for #2986, which introduced this release to an additional hold it takes in rfs3_lookup(). It forgot the possibility of exi being NULL and a direct goto out. The straightforward (and more readable!) fix here is:
diff -r b151bd260b71 usr/src/uts/common/fs/nfs/nfs3_srv.c --- a/usr/src/uts/common/fs/nfs/nfs3_srv.c Thu Dec 13 11:29:00 2012 -0800 +++ b/usr/src/uts/common/fs/nfs/nfs3_srv.c Fri Jan 04 16:53:23 2013 -0500 @@ -433,6 +433,7 @@ rfs3_lookup(LOOKUP3args *args, LOOKUP3re goto out1; } + /* Take an extra hold here in case of public file handle usage. */ exi_hold(exi); /* @@ -550,10 +551,14 @@ rfs3_lookup(LOOKUP3args *args, LOOKUP3re out: /* - * The passed argument exportinfo is released by the - * caller, common_dispatch - */ - exi_rele(exi); + * The passed argument exportinfo is released here too, because of + * the additional hold taken earlier. + * + * There's another way of getting here, though, so we need a NULL + * exi check. + */ + if (exi != NULL) + exi_rele(exi); if (curthread->t_flag & T_WOULDBLOCK) { curthread->t_flag &= ~T_WOULDBLOCK;
Updated by Vitaliy Gusev over 10 years ago
- Status changed from New to In Progress
- Assignee set to Vitaliy Gusev
I'll take it.
Updated by Vitaliy Gusev over 10 years ago
- File nfs-regress-fix.patch nfs-regress-fix.patch added
Greg, could you check this patch ?
Updated by Greg Mason over 10 years ago
We had the system panic again before I could get the patch in place, but it was without an ESXi client in place. There were only RHEL 6.1 nfsv3 clients with any exports mounted.
I've got the patch in place on the system, and it looks like it has fixed the issue. Once I got the patched nfssrv module in place, the panic/reboot loop stopped.
Here's a crash dump from the last panic the system had (immediately before I got the patch in place): [[http://share.hpcc.msu.edu/users/gmason/vmdump.4]].
Updated by Marcel Telka over 10 years ago
- Category changed from zfs - Zettabyte File System to nfs - NFS server and client
Updated by Andrew Rokitka over 10 years ago
Is there any documentation on how to apply and build this patch?
Updated by Vitaliy Gusev over 10 years ago
Common approach to apply - "patch -p1 {PATCH}"
Building is described at "HOW_TO_BUILD illumos"
Updated by Patrick Domack over 10 years ago
My system seems to have stablilized also after adding this patch.
Well it stabilized for a little over a week, and my issue is back. Not sure my issue was directly completely related to this bug though, but as it was suppost to also fix my issue.
Updated by Rasmus Fauske over 10 years ago
Hi,
Is this patch included in a binary release ? or is there an plan to when it will be introduced ?
Updated by Marcel Telka about 10 years ago
- Assignee changed from Vitaliy Gusev to Marcel Telka
The fix for #2986 that introduced this issue was backed out from the illumos gate:
Branch: refs/heads/master Home: https://github.com/illumos/illumos-gate Commit: 596bc2391087577f88d3665a6fb36aba8f1f2003 https://github.com/illumos/illumos-gate/commit/596bc2391087577f88d3665a6fb36aba8f1f2003 Author: Marcel Telka <marcel.telka@nexenta.com> Date: 2013-03-14 (Thu, 14 Mar 2013) Changed paths: M usr/src/uts/common/fs/nfs/nfs3_srv.c M usr/src/uts/common/fs/nfs/nfs_server.c M usr/src/uts/common/fs/nfs/nfs_srv.c Log Message: ----------- Back out hg changeset 829c00a55a37, bug 2986 -- introduces vulnerability Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com>
Updated by Rich Lowe about 10 years ago
- Status changed from In Progress to Closed