Project

General

Profile

Bug #4484

lockd: SMF property reading and cli options are incorrect

Added by Youzhong Yang over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
nfs - NFS server and client
Start date:
2014-01-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

This appears to be an issue with the open source NLM (nfs/lockd+klm). Issue is not reproducible when Sun's proprietary lockd is used.

Here is how to reproduce the issue:

1. On the client (Linux or SmartOS/OmniOS) side, mount a share on the server, make sure there exists a writable file (say /mnt/share/file) on the share.

2. Use the attached perl script, run 'perl locktest.pl 60 /mnt/share/file 10'. It will fork 60 processes, each of which tries to lock the given file repeatedly, for 10 seconds maximum.

I did some dtracing, it seems that 'no lock available' is caused by the thread reservation failure (svc_reserve_thread()):

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/klm/nlm_service.c#543


Files

lockfile.pl (4.51 KB) lockfile.pl Youzhong Yang, 2014-01-15 07:53 PM

History

#1

Updated by Marcel Telka over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Marcel Telka

Please try to run this at the NFS server:

$ sharectl get -p lockd_servers nfs

If you'll see less than, let say, 80, then please increase the value to 80.

#2

Updated by Youzhong Yang over 6 years ago

I did 'sharectl set -p lockd_servers=80 nfs', issue is still reproducible. I even tried 640, no luck.

#3

Updated by Marcel Telka over 6 years ago

Are you sure the lockd process has been restarted after you increased lockd_servers ?

#4

Updated by Youzhong Yang over 6 years ago

Yes, I confirmed through 'pgrep -fl lockd'. Thanks.

#5

Updated by Marcel Telka over 6 years ago

The problem is that lockd ignores the lockd_servers setting:

# sharectl set -p lockd_servers=80 nfs
# svcadm restart svc:/network/nfs/nlockmgr:default
# echo ::svc_pool nlm | mdb -k
SVCPOOL = ffffff01d880a200 -> POOL ID = NLM(2)
Non detached threads    = 0
Detached threads        = 0
Max threads             = 20
`redline'               = 1
Reserved threads        = 0
Thread lock     = mutex not held
Asleep threads          = 0
Request lock    = mutex not held
Pending requests        = 0
Walking threads         = 0
Max requests from xprt  = 8
Stack size for svc_run  = 0
Creator lock    = mutex not held
No of Master xprt's     = 2
rwlock for the mxprtlist= owner 0   
master xprt list ptr    = ffffff01d7c45298

#

Youzhong, please confirm you see Max threads = 20 at your machine too after you increase the lockd_servers. Thanks.

#6

Updated by Youzhong Yang over 6 years ago

Yes, I saw max threads = 20 though lockd_servers is set to 640. Thanks.

[root@batfs5900 ~]# sharectl get nfs
servers=512
lockd_listen_backlog=32
lockd_servers=640
lockd_retransmit_timeout=5
grace_period=90
server_versmin=2
server_versmax=3
client_versmin=2
client_versmax=4
server_delegation=on
nfsmapid_domain=
max_connections=-1
protocol=ALL
listen_backlog=32
device=
[root@batfs5900 ~]# echo ::svc_pool nlm | mdb -k
SVCPOOL = ffffff04f6b1c5a8 -> POOL ID = NLM(2)
Non detached threads    = 0
Detached threads        = 0
Max threads             = 20
`redline'               = 1
Reserved threads        = 0
Thread lock     = mutex not held
Asleep threads          = 0
Request lock    = mutex not held
Pending requests        = 0
Walking threads         = 0
Max requests from xprt  = 8
Stack size for svc_run  = 0
Creator lock    = mutex not held
No of Master xprt's     = 2
rwlock for the mxprtlist= owner 0
master xprt list ptr    = ffffff04f6a22900
#7

Updated by Marcel Telka over 6 years ago

Please try this as a workaround:

# svccfg -s svc:/network/nfs/nlockmgr:default setprop nfs-props/max_servers = integer: 80
# svcadm refresh svc:/network/nfs/nlockmgr:default
# svcadm restart svc:/network/nfs/nlockmgr:default
#8

Updated by Marcel Telka over 6 years ago

  • Subject changed from Trying to lock the same file over NFS by a bunch of client processes could easily return ENOLCK (No record locks available) to lockd: SMF property reading is incorrect
#9

Updated by Youzhong Yang over 6 years ago

It works. Thanks.

#10

Updated by Chip Schweiss over 6 years ago

I found this bug report because I am experiencing the same problem on OmniOS r151008j.

The 'Max threads' was not getting set correctly by sharectl. However, setting with svccfg did update the setting, file locking is still not working properly.

The perl script attached by Youzhong Yang still cannot get any locks when multiple threads are attempting to get a lock. When run against an OmniOS 151006p installation or OpenIndiana 151a7, each thread will get one lock before finishing.

If the script is run against a NFS v4 mount against any of my ZFS servers, it will get 100's of locks in each thread.

# perl ~/lockfile.pl 10 lock-test.txt 10
INFO: Successfully forked 187875
INFO: Successfully forked 187876
INFO: Successfully forked 187877
INFO: Successfully forked 187878
INFO: Successfully forked 187879
INFO: Successfully forked 187880
INFO: Successfully forked 187881
INFO: Successfully forked 187882
INFO: Successfully forked 187883
INFO: Successfully forked 187884
INFO: pid 187879 successfully locked file 1530 times in 10 seconds
INFO: pid 187876 successfully locked file 155 times in 10 seconds
INFO: pid 187882 successfully locked file 919 times in 10 seconds
INFO: pid 187880 successfully locked file 133 times in 10 seconds
INFO: pid 187883 successfully locked file 609 times in 10 seconds
INFO: pid 187881 successfully locked file 833 times in 10 seconds
INFO: pid 187884 successfully locked file 812 times in 10 seconds
INFO: pid 187878 successfully locked file 50 times in 10 seconds
INFO: pid 187875 successfully locked file 85 times in 11 seconds
INFO: pid 187877 successfully locked file 1137 times in 12 seconds

If I'm reading the script correctly, the behavior under NFS v4 is the only correct behavior, because immediately after locking the file the script closes the file which should release the lock. Against a local file system each tread locks 1000's of time leading me to believe it is broken in NFS v3.

The script never explicitly releases the lock so closing the file may not be properly releasing the lock. This may be an inherit difference between NFS v3 and v4.

#11

Updated by Marcel Telka over 6 years ago

I tested the script with NFSv3 and it worked as expected (IIRC). Maybe you see some other bug. In any case, your issue is a separate problem, not directly related to the SMF property reading issue. Please file separate bug report to cover your issue.

Thanks.

#12

Updated by Chip Schweiss over 6 years ago

I have isolated my issue. It turns out a recent change has made a difference in the NLM communication. This has triggered an issue with the connection tracking in the Linux firewall which started blocking a connection back from the server.

I'm running Centos 6.5 on the client side. I don't know how wide spread this problem is with Linux.

The solution was to explicitly open tcp port 111 and TCP high ports to the NFS server on the NFS client sided.

My issue turns out not to be an Illumos bug.

#13

Updated by Marcel Telka over 6 years ago

  • Status changed from In Progress to Pending RTI
#14

Updated by Marcel Telka over 6 years ago

This bug is fixing all of the issues related to the SMF property reading, default values of the parameters and command line options, namely:

  • The "listen_backlog" property name was changed to "lockd_listen_backlog" during the SMF read. The "lockd_listen_backlog" is the proper name for this property.
  • Similarly, "max_servers" was changed to "lockd_servers".
  • Similarly, "retrans_timeout" was changed to "lockd_retransmit_timeout".
  • The default value of "grace" is changed from 60 to 90; the "real" default value is (and always was) 90. See:

http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/fs.d/nfs/svc/nlockmgr.xml#95
http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libshare/nfs/libshare_nfs.c#2350
http://docs.oracle.com/cd/E23824_01/html/821-1462/lockd-1m.html

  • Similarly, "retransmittimeout" is changed from 15 to 5. The reason for this change is same as for the "grace" (see above).
  • In a case the SMF property reading fails (IOW, nfs_smf_get_iprop() fails), this is logged via syslog().
  • The legacy closed source lockd used to use -t command line option to override the "lockd_retransmit_timeout". The current lockd used -r for that. This inconsistency was fixed and -t is used now for overriding the "lockd_retransmit_timeout".
  • Support for setting "conn_idle_timeout" (both via SMF and via command line) was removed. The legacy closed source lockd had no support for setting this and this property is not defined in nlockmgr.xml. There is also some potential future conflict with this SMF property name.
  • The reading of "max_connections" SMF property was removed. The legacy closed source lockd had no support for setting this and this property is not defined in nlockmgr.xml. There is also some potential future conflict with this SMF property name. The support for setting "max_connections" using -c command line option was retained.
#15

Updated by Robert Mustacchi over 6 years ago

  • Status changed from Pending RTI to Resolved
  • % Done changed from 0 to 100
#16

Updated by Robert Mustacchi over 6 years ago

  • Status changed from Resolved to Pending RTI
#17

Updated by Robert Mustacchi over 6 years ago

  • Subject changed from lockd: SMF property reading is incorrect to lockd: SMF property reading and cli options are incorrect
  • Tags deleted (needs-triage)
#18

Updated by Robert Mustacchi over 6 years ago

  • Status changed from Pending RTI to Resolved
#19

Updated by Electric Monk over 6 years ago

git commit 42cdb25977ffa6ebef76064b66ad665992fadee5

Author: Marcel Telka <marcel.telka@nexenta.com>

4484 lockd: SMF property reading and cli options are incorrect
Reviewed by: Gary Mills <gary_mills@fastmail.fm>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: T Nguyen <truongqnguien@gmail.com>
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF