Project

General

Profile

Actions

Bug #3449

closed

System panics after zfs rollback

Added by Greg Mason over 10 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2013-01-04
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

with sharenfs enabled for a filesystem, I ran a ZFS rollback. The system promptly panicked. From the console:

panic[cpu26]/thread=ffffff4777493120: mutex_destroy: not owner, lp=ffffff4349dd03c8 owner=ffffff46e3fb0be0 thread=ffffff4777493120

ffffff01eb948870 unix:mutex_panic+73 ()
ffffff01eb9488a0 unix:mutex_destroy+60 ()
ffffff01eb9488d0 nfssrv:exportfree+f4 ()
ffffff01eb948900 nfssrv:exi_rele+60 ()
ffffff01eb948c20 nfssrv:common_dispatch+c5 ()
ffffff01eb948c40 nfssrv:rfs_dispatch+2d ()
ffffff01eb948d20 rpcmod:svc_getreq+1c1 ()
ffffff01eb948d90 rpcmod:svc_run+146 ()
ffffff01eb948dd0 rpcmod:svc_do_run+8e ()
ffffff01eb948ec0 nfs:nfssys+f1 ()
ffffff01eb948f10 unix:brand_sys_sysenter+1c9 ()

panic: entering debugger (continue to save dump)

Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ scsi_vhci stmf_sbd crypto mac cpc uppc neti sd ptm ufs unix 
zfs krtld s1394 apix uhci hook lofs genunix idm ip logindmux nsmb usba specfs 
pcplusmp nfs md random cpu.generic arp mpt_sas stmf sockfs smbsrv ]
[26]> :c
syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel

Since this initial panic, the system will panic shortly after all ZFS filesystems are mounted.

From fmdump:

TIME                           UUID                                 SUNW-MSG-ID
Jan 04 2013 14:41:35.633638000 7034218a-9142-4785-a4dd-c5eb0fe9748c SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Jan 04 14:41:35.5759 ireport.os.sunos.panic.dump_available 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 7034218a-9142-4785-a4dd-c5eb0fe9748c
        code = SUNOS-8000-KL
        diag-time = 1357328495 587499
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = sw:///:path=/var/crash/unknown/.7034218a-9142-4785-a4dd-c5eb0fe9748c
                resource = sw:///:path=/var/crash/unknown/.7034218a-9142-4785-a4dd-c5eb0fe9748c
                savecore-succcess = 1
                dump-dir = /var/crash/unknown
                dump-files = vmdump.1
                os-instance-uuid = 7034218a-9142-4785-a4dd-c5eb0fe9748c
                panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff01ec1295c0 addr=108 occurred in module "unix" due to a NULL pointer dereference
                panicstack = unix:die+df () | unix:trap+db3 () | unix:cmntrap+e6 () | unix:mutex_enter+b () | nfssrv:rfs3_lookup+508 () | nfssrv:common_dispatch+539 () | nfssrv:rfs_dispatch+2d () | rpcmod:svc_getreq+1c1 () | rpcmod:svc_run+146 () | rpcmod:svc_do_run+8e () | nfs:nfssys+f1 () | unix:brand_sys_sysenter+1c9 () | 
                crashtime = 1357323719
                panic-time = Fri Jan  4 13:21:59 2013 EST
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x50e7306f 0x25c48c70

Core files are from one of the first panics today, the second is from the most recent panic.

http://share.hpcc.msu.edu/users/gmason/vmdump.0
http://share.hpcc.msu.edu/users/gmason/vmdump.1


Files

nfs-regress-fix.patch (1.09 KB) nfs-regress-fix.patch Vitaliy Gusev, 2013-01-05 04:04 PM

Related issues

Related to illumos gate - Bug #2986: nfs: exi refcounter leak at rfs3_lookupResolvedMarcel Telka2012-07-10

Actions
Has duplicate illumos gate - Bug #3435: nfssrv causes bad mutex crashClosedMarcel Telka2012-12-31

Actions
Has duplicate illumos gate - Bug #3472: nfs causing kernel panicClosedMarcel Telka2013-01-14

Actions
Actions #1

Updated by Greg Mason over 10 years ago

In case it's of any use, here's another core:

http://share.hpcc.msu.edu/users/gmason/vmdump.2

Actions #2

Updated by Dan McDonald over 10 years ago

Thank you!

vmdump.1 proved useful.

Look at rfs3_lookup() for a case where exi (export information) is NULL. The function mostly seems to handle it, and it hits here:

401        if (dvp == NULL) {
402            error = ESTALE;
403            goto out;
404        }

If you look at out:

551out:
552    /*
553     * The passed argument exportinfo is released by the
554     * caller, common_dispatch
555     */
556    exi_rele(exi);
557

The comment seems to indicate that the caller would release the exi. It does, BUT there's an additional hold in some places thanks to a bugfix.

The bugfix in question is for #2986, which introduced this release to an additional hold it takes in rfs3_lookup(). It forgot the possibility of exi being NULL and a direct goto out. The straightforward (and more readable!) fix here is:

diff -r b151bd260b71 usr/src/uts/common/fs/nfs/nfs3_srv.c
--- a/usr/src/uts/common/fs/nfs/nfs3_srv.c    Thu Dec 13 11:29:00 2012 -0800
+++ b/usr/src/uts/common/fs/nfs/nfs3_srv.c    Fri Jan 04 16:53:23 2013 -0500
@@ -433,6 +433,7 @@ rfs3_lookup(LOOKUP3args *args, LOOKUP3re
         goto out1;
     }

+    /* Take an extra hold here in case of public file handle usage. */
     exi_hold(exi);

     /*
@@ -550,10 +551,14 @@ rfs3_lookup(LOOKUP3args *args, LOOKUP3re

 out:
     /*
-     * The passed argument exportinfo is released by the
-     * caller, common_dispatch
-     */
-    exi_rele(exi);
+     * The passed argument exportinfo is released here too, because of
+     * the additional hold taken earlier.
+     *
+     * There's another way of getting here, though, so we need a NULL
+     * exi check.
+     */
+    if (exi != NULL)
+        exi_rele(exi);

     if (curthread->t_flag & T_WOULDBLOCK) {
         curthread->t_flag &= ~T_WOULDBLOCK;
Actions #3

Updated by Vitaliy Gusev over 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Vitaliy Gusev

I'll take it.

Actions #4

Updated by Vitaliy Gusev over 10 years ago

Greg, could you check this patch ?

Actions #5

Updated by Greg Mason over 10 years ago

We had the system panic again before I could get the patch in place, but it was without an ESXi client in place. There were only RHEL 6.1 nfsv3 clients with any exports mounted.

I've got the patch in place on the system, and it looks like it has fixed the issue. Once I got the patched nfssrv module in place, the panic/reboot loop stopped.

Here's a crash dump from the last panic the system had (immediately before I got the patch in place): [[http://share.hpcc.msu.edu/users/gmason/vmdump.4]].

Actions #6

Updated by Marcel Telka over 10 years ago

  • Category changed from zfs - Zettabyte File System to nfs - NFS server and client
Actions #7

Updated by Andrew Rokitka over 10 years ago

Is there any documentation on how to apply and build this patch?

Actions #8

Updated by Vitaliy Gusev over 10 years ago

Common approach to apply - "patch -p1 {PATCH}"
Building is described at "HOW_TO_BUILD illumos"

Actions #9

Updated by Patrick Domack over 10 years ago

My system seems to have stablilized also after adding this patch.

Well it stabilized for a little over a week, and my issue is back. Not sure my issue was directly completely related to this bug though, but as it was suppost to also fix my issue.

Actions #10

Updated by Rasmus Fauske over 10 years ago

Hi,

Is this patch included in a binary release ? or is there an plan to when it will be introduced ?

Actions #11

Updated by Marcel Telka about 10 years ago

  • Assignee changed from Vitaliy Gusev to Marcel Telka

The fix for #2986 that introduced this issue was backed out from the illumos gate:

  Branch: refs/heads/master
  Home:   https://github.com/illumos/illumos-gate
  Commit: 596bc2391087577f88d3665a6fb36aba8f1f2003
      https://github.com/illumos/illumos-gate/commit/596bc2391087577f88d3665a6fb36aba8f1f2003
  Author: Marcel Telka <marcel.telka@nexenta.com>
  Date:   2013-03-14 (Thu, 14 Mar 2013)

  Changed paths:
    M usr/src/uts/common/fs/nfs/nfs3_srv.c
    M usr/src/uts/common/fs/nfs/nfs_server.c
    M usr/src/uts/common/fs/nfs/nfs_srv.c

  Log Message:
  -----------
  Back out hg changeset 829c00a55a37, bug 2986  --  introduces vulnerability
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Actions #12

Updated by Rich Lowe about 10 years ago

  • Status changed from In Progress to Closed
Actions

Also available in: Atom PDF