Project

General

Profile

Bug #3629

NFSv4 client: assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2429

Added by Denis Kozadaev over 7 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2013-03-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

There is a kernel panic due to NFS i/o processing.
vmdump is available (probably, with debug info, unfortunately I don't remember it exactly)
some info from mdb:

witch# mdb -k unix.9 vmcore.9
Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs ip hook neti sockfs arp usba sd stmf stmf_sbd s1394 fctl lofs sata random nfs idm cpc fcip crypto sppp ufs logindmux ptm ipc ]
> ::status
debugging crash dump vmcore.9 (64-bit) from witch
operating system: 5.11 oi_dev (i86pc)
image uuid: 53214ce9-0ff3-48eb-a21e-edc36b25b361
panic message: assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2420
dump content: kernel pages only
> ::stack
vpanic()
assfail+0x89(fffffffffbfa3b78, fffffffffbfa3de0, 974)
vn_free+0x111(ffffff039d062300)
destroy_rnode4+0x66(ffffff01c3e00038)
rp4_addfree+0x16e(ffffff01c3e00038, ffffff01c8c58c68)
nfs4_inactive_otw+0x4f9(ffffff039d062300, ffffff01c8c58c68)
nfs4_inactive_thread+0xb5(ffffffa6649d6000)
thread_start+8()
> ::panicinfo
             cpu                0
          thread ffffff0005d13c40
         message assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2420
             rdi fffffffffbf87248
             rsi ffffff0005d138e0
             rdx fffffffffbfa3de0
             rcx ffffff0005d13920
              r8 ffffff01ce9407a9
              r9                0
             rax ffffff0005d13830
             rbx fffffffffbfa3de0
             rbp ffffff0005d13910
             r10 ffffff04d1d310a8
             r10 ffffff04d1d310a8
             r11                7
             r12 fffffffffbfa3b78
             r13              974
             r14 ffffffa6649d6148
             r15 ffffffa4d4c555b8
          fsbase                0
          gsbase fffffffffbc3e420
              ds               4b
              es               4b
              fs                0
              gs              1c3
          trapno                0
             err                0
             rip fffffffffb86b760
              cs               30
          rflags              246
             rsp ffffff0005d13828
              ss               38
          gdt_hi                0
          gdt_lo         e00001ef
          idt_hi                0
          idt_lo         d0000fff
             ldt                0
            task               70
             cr0         8005003b
             cr2          8046cbc
             cr3          5400000
             cr4              6f8

The file is usr/src/uts/common/fs/vnode.c:

void
vn_free(vnode_t *vp)
{
        ASSERT(vp->v_shrlocks == NULL);
        ASSERT(vp->v_filocks == NULL);
...


Related issues

Related to illumos gate - Bug #7721: Poor nfs4_frlock() performance for blocking locksIn ProgressMarcel Telka2017-01-03

Actions
#1

Updated by Marcel Telka over 7 years ago

  • Category set to nfs - NFS server and client

Could you please make the crash dump file available? Thanks.

#3

Updated by Marcel Telka over 7 years ago

This is the vnode that caused the panic (please note v_filocks):

> ffffff039d062300::print vnode_t
{
    v_lock = {
        _opaque = [ 0 ]
    }
    v_flag = 0x10002
    v_count = 0x1
    v_data = 0xffffff01c3e00038
    v_vfsp = 0xffffff01ad7d6930
    v_stream = 0
    v_type = 1 (VREG)
    v_rdev = 0
    v_vfsmountedhere = 0
    v_op = 0xffffff01a9697c40
    v_pages = 0
    v_filocks = 0xffffffef2b3b6700
    v_shrlocks = 0
    v_nbllock = {
        _opaque = [ 0 ]
    }
    v_cv = {
        _opaque = 0
    }
    v_locality = 0
    v_femhead = 0
    v_path = 0xffffff01a5cb9b28 "/home/andrej/foxeye/atest.html" 
    v_rdcnt = 0
    v_wrcnt = 0
    v_mmap_read = 0
    v_mmap_write = 0
    v_mpssdata = 0
    v_fopdata = 0
    v_vsd_lock = {
        _opaque = [ 0 ]
    }
    v_vsd = 0
    v_xattrdir = 0
    v_count_dnlc = 0
}
> 

The machine panicked at 2013 Mar 15 14:19:05:

> time/Y
time:
time:           2013 Mar 15 14:19:05    
>

There are three mounted nfs filesystems:

> ::fsinfo!grep nfs
ffffff01ad7d6760 nfs             /nfs
ffffff01ad7d6590 nfs             /home/denis
ffffff01ad7d6930 nfs             /home/andrej
>

and we have several recovery related messages logged around the panic time. This is for the /nfs:

> ffffff01ad7d6760::print vfs_t vfs_data|::print mntinfo4_t mi_msg_list|::nfs4_mimsg ! tail
[NFS4]2013 Mar 15 11:56:09: Starting recovery for mount /nfs (0xffffff04c70f9000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:56:09: Recovery done for mount /nfs (0xffffff04c70f9000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:56:58: Starting recovery for mount /nfs (0xffffff04c70f9000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:56:58: Recovery done for mount /nfs (0xffffff04c70f9000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:57:09: Starting recovery for mount /nfs (0xffffff04c70f9000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:57:09: Recovery done for mount /nfs (0xffffff04c70f9000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:04: Starting recovery for mount /nfs (0xffffff04c70f9000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:04: Recovery done for mount /nfs (0xffffff04c70f9000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:05: Starting recovery for mount /nfs (0xffffff04c70f9000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:05: Recovery done for mount /nfs (0xffffff04c70f9000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
>

This is for /home/denis:

> ffffff01ad7d6590::print vfs_t vfs_data|::print mntinfo4_t mi_msg_list|::nfs4_mimsg ! tail
[NFS4]2013 Mar 15 11:56:09: Starting recovery for mount /home/denis (0xffffffa20b73b000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:56:09: Recovery done for mount /home/denis (0xffffffa20b73b000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:56:58: Starting recovery for mount /home/denis (0xffffffa20b73b000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:56:58: Recovery done for mount /home/denis (0xffffffa20b73b000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:57:09: Starting recovery for mount /home/denis (0xffffffa20b73b000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:57:09: Recovery done for mount /home/denis (0xffffffa20b73b000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:04: Starting recovery for mount /home/denis (0xffffffa20b73b000, flags 0x22) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:05: File ./.xsession-errors(rnode_pt ffffff01cee9c440) on server fucker could not be recovered and was closed.  Couldn't reopen: NFSv4 error NFS4ERR_OLD_STATEID.
[NFS4]2013 Mar 15 14:19:05: Op OP_OPEN got error NFS4ERR_OLD_STATEID causing recovery action NR_OLDSTATEID.
[NFS4]2013 Mar 15 14:19:05: Recovery done for mount /home/denis (0xffffffa20b73b000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
>

And this is for /home/andrej:

> ffffff01ad7d6930::print vfs_t vfs_data|::print mntinfo4_t mi_msg_list|::nfs4_mimsg ! tail
[NFS4]2013 Mar 15 11:57:09: Starting recovery for mount /home/andrej (0xffffffa6649d6000, flags 0x1) on server fucker, rnode_pt1 ./foxeye/foxeye.debug (0xffffff0499206030), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 11:57:10: Recovery done for mount /home/andrej (0xffffffa6649d6000) on server fucker, rnode_pt1 ./foxeye/foxeye.debug (0xffffff0499206030), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:04: Op OP_RENEW got error NFS4ERR_EXPIRED causing recovery action NR_CLIENTID.
[NFS4]2013 Mar 15 14:19:04: Starting recovery for mount /home/andrej (0xffffffa6649d6000, flags 0x1) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
[NFS4]2013 Mar 15 14:19:04: Couldn't reclaim lock for pid 151483 for file ./foxeye/logs/ugolok (rnode_pt ffffff04c39fc400) on (server fucker): error 10033
[NFS4]2013 Mar 15 14:19:04: Process 151483 lost its locks on file ./foxeye/logs/ugolok (rnode_pt ffffff04c39fc400) due to a NFS error (10033) on server fucker
[NFS4]2013 Mar 15 14:19:05: File ./foxeye/atest.html(rnode_pt ffffff01c3e00038) on server fucker could not be recovered and was closed.  Couldn't reopen: NFSv4 error NFS4ERR_OLD_STATEID.
[NFS4]2013 Mar 15 14:19:05: Op OP_OPEN got error NFS4ERR_OLD_STATEID causing recovery action NR_OLDSTATEID.
[NFS4]2013 Mar 15 14:19:05: Op OP_OPEN got error NFS4ERR_EXPIRED causing recovery action NR_CLIENTID.
[NFS4]2013 Mar 15 14:19:05: Recovery done for mount /home/andrej (0xffffffa6649d6000) on server fucker, rnode_pt1 <NULL> (0x0), rnode_pt2 <NULL> (0x0)
>

From the log we see that the recovery for /home/andrej/foxeye/atest.html failed (the file was closed). Immediately after that we panicked.

#4

Updated by Piotr Jasiukajtis over 7 years ago

Denis Kozadaev wrote:

There is a kernel panic due to NFS i/o processing.
vmdump is available (probably, with debug info, unfortunately I don't remember it exactly)
some info from mdb:
witch# mdb -k unix.9 vmcore.9
Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs ip hook neti sockfs arp usba sd stmf stmf_sbd s1394 fctl lofs sata random nfs idm cpc fcip crypto sppp ufs logindmux ptm ipc ]

::status

debugging crash dump vmcore.9 (64-bit) from witch
operating system: 5.11 oi_dev (i86pc)

Denis, what is an illumos revision of this build?

#5

Updated by Denis Kozadaev over 7 years ago

Unfortunately, I can't say anything about the revision :(
I did "hg pull -u" many times, the source tree is shared via nfs.
The only I can say now is that I booted the system when it was built,
this monday I will say you the date of some system file (/kernel/genunix, for example)
this date will be the nearest of the revision date.
I'm so sorry, guys...

#6

Updated by Denis Kozadaev over 7 years ago

Hello, guys
the gate was built at 25 of feb.
-rwxr-xr-x 1 bin bin 3636544 Feb 25 17:41 genunix
take the latest revision before this date
and good luck

#7

Updated by Marcel Telka almost 4 years ago

  • Description updated (diff)
#8

Updated by Marcel Telka almost 4 years ago

  • Subject changed from kernel panic (openindiana) to NFSv4 client: assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2420
#9

Updated by Marcel Telka almost 4 years ago

Steps to reproduce

You need a debug build for the NFSv4 client (with compiled in asserts) to reproduce the panic.

  1. lock a file over NFSv4, leave the application running, so the file is opened and locked,
  2. remove the file locally at the NFS server,
  3. restart the NFS server (either svcadm restart nfs/server or reboot),
  4. once the NFS server is restarted the NFS client tries to reopen the file, but it fails (with the NFS4ERR_STALE error),
  5. try to terminate the locking application at the NFSv4 client, the system will panic.
> ::status
debugging crash dump vmcore.0 (64-bit) from t3
operating system: 5.11 il-nightly-master-0-g81293f9 (i86pc)
image uuid: c7393aa9-e737-4432-dbf6-d8db39fa9d4f
panic message: assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2429
dump content: kernel pages only
> ::stack
vpanic()
0xfffffffffbdf77e8()
vn_free+0x55(ffffff01a38a8300)
destroy_rnode4+0x7c(ffffff01a3c18be8)
rp4_addfree+0x2de(ffffff01a3c18be8, ffffff01ac2cf430)
nfs4_inactive_otw+0x479(ffffff01a38a8300, ffffff01ac2cf430)
nfs4_inactive_thread+0xde(ffffff01b16b5000)
thread_start+8()
>
#10

Updated by Marcel Telka almost 4 years ago

  • Related to Bug #7721: Poor nfs4_frlock() performance for blocking locks added
#11

Updated by Marcel Telka almost 4 years ago

  • Subject changed from NFSv4 client: assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2420 to NFSv4 client: assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2429
#12

Updated by Marcel Telka almost 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Marcel Telka
#13

Updated by Marcel Telka almost 4 years ago

Root cause

In a case the file open recovery fails (so the R4RECOVERR is set for a particular rnode4_t) then the reclaim_one_lock() does not attempt to recover lost locks for such a file and returns success. This causes that the relock_file() leaks lost locks (in v_filocks).

The fix

The fix changes reclaim_one_lock() so it returns error (ESTALE) in such a case so the relock_file() have a chance to notice a problem and thus send SIGLOST to the particular process via nfs4_send_siglost(). Finally, the relock_skip_pid() is updated to unregister the lost locks.

#15

Updated by Marcel Telka almost 4 years ago

  • Status changed from In Progress to Pending RTI
#16

Updated by Electric Monk almost 4 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit faf39f174e3f2f426d74639ff702c5ba9f29f56b

commit  faf39f174e3f2f426d74639ff702c5ba9f29f56b
Author: Marcel Telka <marcel@telka.sk>
Date:   2017-02-27T17:15:57.000Z

    3629 NFSv4 client: assertion failed: vp->v_filocks == NULL, file: ../../common/fs/vnode.c, line: 2429
    Reviewed by: Arne Jansen <arne@die-jansens.de>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF