Project

General

Profile

Bug #1927

nfs4 stale clientid loop (OI server, Linux client)

Added by Kasper Brink almost 9 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
nfs - NFS server and client
Start date:
2011-12-23
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:
Gerrit CR:

Description

A filesystem mounted over nfs4 from an OI 151a server on a Linux client becomes unresponsive after running a filebench workload on the client for several minutes. Metadata operations (ls, stat, rm, mkdir) still work, but operations involving file contents (e.g. cat) block indefinitely. After restarting the nfs service on the server and waiting for 2 minutes, the nfs4 mount becomes accessible again.

Attached is a tcpdump trace starting about a second before the nfs4 mount stops responding. From packet 8158 onward,the client and server go into a loop where the client gets an NFS4ERR_STALE_CLIENTID on OPEN, it succesfully renews its clientid, and then immediately gets the same error on OPEN again. This seems to be the same problem as described in http://thread.gmane.org/gmane.linux.nfs/44449 . The conclusion of that thread was that this is not a bug in the Linux client.

This behaviour happens both on physical hardware and in Xen PV domains. The problem does not occur when the client is running OI, or the server is running Linux, or over nfs3.

mdb output

Before running the filebench workload, "echo -e '::rfs4_db\n::rfs4_client' | mdb -k" (suggested by Albert Lee) shows:

rfs4_database=ffffff01513b4db0
  debug_flags=00000000   shutdown:      count=0 tables=ffffff01513c66c0
------------------ Table ------------------- Bkt  ------- Indices -------
Address          Name          Flags    Cnt  Cnt  Pointer          Cnt  Max
ffffff01513c66c0 DelegStateID  00000000 0000 2047 ffffff014e6d3d00 0002 0002
ffffff01513c67a0 File          00000000 0000 2047 ffffff014dc6a940 0001 0001
ffffff01513c6880 Lockowner     00000000 0000 2047 ffffff014fbc9400 0002 0002
ffffff01513c6960 LockStateID   00000000 0000 2047 ffffff014e50fec0 0002 0002
ffffff01513c6a40 OpenStateID   00000000 0000 2047 ffffff014bfcc380 0003 0003
ffffff01513c6b20 OpenOwner     00000000 0000 2047 ffffff0149d6fd00 0001 0001
ffffff01513c6c00 ClntIP        00000000 0000 2047 ffffff0149d6e440 0001 0001
ffffff01513c6ce0 Client        00000000 0000 2047 ffffff014e5af2c0 0002 0002
Address          dbe              clientid         confirm_verf     NCnfm unlnk
cp_confirmed     Last Access

When the filesystem hangs, I get:

rfs4_database=ffffff01513b4db0
  debug_flags=00000000   shutdown:      count=0 tables=ffffff01513c66c0
------------------ Table ------------------- Bkt  ------- Indices -------
Address          Name          Flags    Cnt  Cnt  Pointer          Cnt  Max
ffffff01513c66c0 DelegStateID  00000000 0330 2047 ffffff014e6d3d00 0002 0002
ffffff01513c67a0 File          00000000 120906 2047 ffffff014dc6a940 0001 0001
ffffff01513c6880 Lockowner     00000000 0000 2047 ffffff014fbc9400 0002 0002
ffffff01513c6960 LockStateID   00000000 0000 2047 ffffff014e50fec0 0002 0002
ffffff01513c6a40 OpenStateID   00000000 121966 2047 ffffff014bfcc380 0003 0003
ffffff01513c6b20 OpenOwner     00000000 1048576 2047 ffffff0149d6fd00 0001 0001
ffffff01513c6c00 ClntIP        00000000 0000 2047 ffffff0149d6e440 0001 0001
ffffff01513c6ce0 Client        00000000 0001 2047 ffffff014e5af2c0 0002 0002
Address          dbe              clientid         confirm_verf     NCnfm unlnk
cp_confirmed     Last Access
ffffff01527e7770 ffffff01527e7708 664ef4771b 6600000000000000 False False
0                2011 Dec 23 15:02:41

Steps to reproduce the problem

On server:

# (Ramdisk-based pool is fastest, but disk-based works too)
ramdiskadm -a tempdisk 256m
zpool create temppool /dev/ramdisk/tempdisk
zfs set sharenfs=rw=$CLIENT,root=$CLIENT temppool

On client:

# I used Debian Squeeze (6.0.3), but I expect other distros will work as well.
# uname -a : Linux basil 2.6.32-5-xen-amd64 #1 SMP Mon Oct 3 07:53:54 UTC 2011 x86_64 GNU/Linux
# dpkg -l nfs-common :  nfs-common  1:1.2.2-4  NFS support files common to client and server

# Get filebench, either from distro, or download:
#  http://sourceforge.net/projects/filebench/files/filebench/filebench-1.4.9.1
#  untar; ./configure && make && make install

mkdir /mnt/temppool
mount -t nfs4 -o rw,sync,hard $SERVER:/temppool /mnt/temppool
# Check that filesystem is writeable
touch /mnt/temppool/foo

# Save nfs4test.f (attached) to a file

for i in $(seq 15); do echo ===== $i $(date); filebench -f nfs4test.f; done
# The NFS4 mount should become unresponsive around iteration 8 or 9...


Files

nfs4test.f (379 Bytes) nfs4test.f Filebench workload to reproduce the issue Kasper Brink, 2011-12-23 02:50 PM
nfs4err.pcap (2.62 MB) nfs4err.pcap tcpdump trace Kasper Brink, 2011-12-23 02:50 PM

Related issues

Related to illumos gate - Bug #417: Stale OpenOwner entries are not reaped for active clientsResolvedVitaliy Gusev2010-11-15

Actions
#1

Updated by William Dauchy almost 9 years ago

Just for information, we applied several fixes from Trond to resolve the problem on client side, even if it does not change anything on server side. It seems to fix the problem on client side:

http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commit;h=4b44b40e04a758e2242ff4a3f7c15982801ec8bc
http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commit;h=652f89f64fabcdae9143ee2b4253cfa838fb0279

#2

Updated by William Dauchy almost 9 years ago

Hm spoke too fast. Was able to reproduce with filebench.

#3

Updated by Gordon Ross over 8 years ago

  • Status changed from New to Resolved
changeset:   13756:281242a86f07
tag:         tip
user:        Vitaliy Gusev <gusev.vitaliy@nexenta.com>
date:        Sun Jul 22 00:00:47 2012 +0000

description:
    417 Stale OpenOwner entries are not reaped for active clients
    Reviewed by: Albert Lee <trisk@nexenta.com>
    Approved by: Gordon Ross <gwr@nexenta.com>

modified:
   usr/src/uts/common/fs/nfs/nfs4_state.c

#4

Updated by Marcel Telka almost 8 years ago

  • Status changed from Resolved to In Progress
  • Assignee set to Marcel Telka
  • Priority changed from Normal to Low

The #1927 is not the same issue as #417. It is a combination of these two bugs:

  • first is #417 (already fixed in illumos) - this allowed to exhaust the OpenOwner table
  • the other bug is still in the illumos NFS server, not fixed yet. I think it is here:

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/nfs/nfs4_srv.c#7340

When the rfs4_findopenowner() fails (because the OpenOwner table is exhausted) it should not return NFS4ERR_STALE_CLIENTID, but NFS4ERR_RESOURCE instead, or something like that.

It is true that with the #417 fixed it is hard to see the #1927 reproduced, but the bug is still there.

#5

Updated by Marcel Telka almost 8 years ago

  • % Done changed from 0 to 20
  • Difficulty changed from Medium to Bite-size
#6

Updated by Marcel Telka almost 8 years ago

  • Status changed from In Progress to Pending RTI
#7

Updated by Rich Lowe almost 8 years ago

  • Status changed from Pending RTI to Resolved
  • % Done changed from 20 to 100
  • Tags deleted (needs-triage)

Resolved in 11bb729

Also available in: Atom PDF