Project

General

Profile

Actions

Bug #4517

open

Locking the same file over NFS from a Linux client by multiple processes could cause 30 seconds delay accumulatively if the NFS server machine has more than one IPv4 interface.

Added by Youzhong Yang almost 10 years ago. Updated about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
nfs - NFS server and client
Start date:
2014-01-23
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

This issue happens between a Linux client and Solaris/SmartOS/OmniOS server, even when Sun's proprietary lockd is used. A MAC OS client works well with Solaris/SmartOS/OmniOS server. I am just wondering if anything can be done now since we are using open source nlockmgr.

Here is how I reproduced the issue:
-------------------------------------------------------
A SmartOS server has the following setting:

hostname-nfs0 with IP address 172.30.192.194
hostname-nfs1 with IP address 172.30.192.191

ifconfig -a

lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
aggr0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 172.30.192.220 netmask ffffff00 broadcast 172.30.192.255
ether 0:1b:21:d4:84:44
admin0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> metric 2 mtu 1500 index 3
inet 172.30.192.194 netmask ffffff00 broadcast 172.30.192.255
ether 2:8:20:40:29:c1
admin1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> metric 2 mtu 1500 index 4
inet 172.30.192.191 netmask ffffff00 broadcast 172.30.192.255
ether 2:8:20:5a:f9:e
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

On a Linux host (with IP address 172.28.146.50):
/mnt/Atnas4 is mounted to hostname-nfs0:/vmgr/Atnas4
/mnt/Atnas5 is mounted to hostname-nfs1:/vmgr/Atnas5
/mnt/Atnas4/scratch/file and /mnt/Atnas5/scratch/file are existing files writable by the user.

Then on the Linux host run the following commands (perl script "lockfile.pl" is attached):

perl lockfile.pl 3 /mnt/Atnas4/scratch/file 2

INFO: Successfully forked 8316
INFO: Successfully forked 8317
INFO: Successfully forked 8318
INFO: pid 8318 successfully locked file 1091 times in 2 seconds
INFO: pid 8317 successfully locked file 1116 times in 2 seconds
INFO: pid 8316 successfully locked file 1121 times in 2 seconds

perl lockfile.pl 3 /mnt/Atnas5/scratch/file 2

INFO: Successfully forked 8411
INFO: Successfully forked 8412
INFO: Successfully forked 8413
INFO: pid 8411 successfully locked file 1 times in 30 seconds
INFO: pid 8412 successfully locked file 1 times in 60 seconds
INFO: pid 8413 successfully locked file 2 times in 90 seconds

Attached tcpdump file (atnas5.pcap) has the packets related to the last command. Frame 1 through 94 are all between the Linux host (172.28.146.50) and hostname-nfs1 (172.30.192.191), but starting frame No. 95, hostname-nfs0 (172.30.192.194) is involved. I remember there's a Linux bug filed for this issue (sorry I cannot find it now) but Linux developer refused to fix it stating that the Solaris/SmartOS server should not switch to another IP to communicate with the Linux client, fixing it can cause security problem.


Files

atnas5.pcap (25.3 KB) atnas5.pcap Youzhong Yang, 2014-01-23 05:16 PM
lockfile.pl (4.54 KB) lockfile.pl Youzhong Yang, 2014-01-23 05:16 PM
Actions #1

Updated by Youzhong Yang almost 10 years ago

Could someone take a look at this issue? if I can be given some pointers w.r.t. which source code files to look at, that will also be very helpful. Thanks.

Actions #3

Updated by Chip Schweiss almost 10 years ago

This bug became a showstopper for my NFSv3 clients. Users who's home directories were mounted over NFSv3 could not log in more than once with csh as the .history file would get caught up in this.

If have been forced to move my clients to NFSv4.

Actions #4

Updated by Tobias Oetiker over 8 years ago

anyone looking into this ? one of our customers just ran into this with omnios r014

Actions #5

Updated by Gordon Ross about 2 years ago

The github/Nexenta gate has a fix to make the lock manager initiate outbound traffic with a local binding that's the same as the inbound traffic came in on. That will probably fix this. See: 7f413829f2d8f72678172516a1f5d358a3f9d299

Actions

Also available in: Atom PDF