Project

General

Profile

Bug #4518

lockd: Cannot establish NLM service over <file desc. 9, protocol udp>

Added by Youzhong Yang over 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2014-01-23
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

This error was observed when the server machine was rebooted. nlockmgr failed to start but then it succeeded when tried again.

/var/adm/messages =>

Jan 18 16:31:55 batfs9920 klmmod: [ID 814159 kern.notice] NOTICE: Failed to connect to local statd (rpcerr=5)
Jan 18 16:31:55 batfs9920 /usr/lib/nfs/lockd[3527]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting
Jan 18 16:31:55 batfs9920 svc.startd[8]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.

/var/svc/log/network-nfs-nlockmgr:default.log =>

[ Jan 18 16:31:09 Executing start method ("/lib/svc/method/nlockmgr"). ]
[ Jan 18 16:31:55 Method "start" exited with status 1. ]
[ Jan 18 16:31:55 Executing start method ("/lib/svc/method/nlockmgr"). ]
[ Jan 18 16:31:57 Method "start" exited with status 0. ]


Related issues

Related to illumos gate - Bug #7231: nlockmgr failing to start up during bootupClosed2016-07-28

Actions

History

#1

Updated by Marcel Telka over 5 years ago

Are you able to reproduce this issue?

#2

Updated by Igor Kozhukhov over 5 years ago

Marcel Telka wrote:

Are you able to reproduce this issue?

i have the same issue.
easy to reproduce by create some datasets on different pool with some included datasets - for delay with 'mount all' operation.
i have about 300 mount with enabled NFS share on datasets.

#3

Updated by Robin Bowes over 5 years ago

I can report pretty much exactly the same error on SmartOS (joyent_20140124T065835Z):

2014-01-28T23:09:14.047710+00:00 nas02 klmmod: [ID 814159 kern.notice] NOTICE: Failed to connect to local statd (rpcerr=5)#012
2014-01-28T23:09:14+00:00 nas02 /usr/lib/nfs/lockd[2294]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting
2014-01-28T23:09:14+00:00 nas02 svc.startd[8]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.

#4

Updated by Marcus Woods over 5 years ago

Marcel Telka wrote:

Are you able to reproduce this issue?

I have this same issue.

[ Apr 20 20:28:17 Executing start method ("/lib/svc/method/nlockmgr"). ]
[ Apr 20 20:29:07 Method "start" exited with status 1. ]
[ Apr 20 20:29:07 Executing start method ("/lib/svc/method/nlockmgr"). ]
[ Apr 20 20:29:09 Method "start" exited with status 0. ]
[ Apr 20 20:58:57 Enabled. ]
[ Apr 20 21:08:52 Enabled. ]
[ Apr 20 21:09:11 Executing start method ("/lib/svc/method/nlockmgr"). ]
[ Apr 20 21:09:56 Method "start" exited with status 1. ]
[ Apr 20 21:09:56 Executing start method ("/lib/svc/method/nlockmgr"). ]
[ Apr 20 21:09:59 Method "start" exited with status 0. ]

#5

Updated by Mr. Jester over 5 years ago

Very similiar issue here. On boot, I get these errors and after a few minutes, svcadm clear svc:/network/nfs/nlockmgr:default bring everything up properly.

root@fs0:~# cat /etc/release
OmniOS v11 r151010

May 12 03:09:50 fs0 klmmod: [ID 432453 kern.notice] NOTICE: Failed to initialize NSM handler (error=2)
May 12 03:09:50 fs0 /usr/lib/nfs/lockd[583]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
May 12 03:09:50 fs0 svc.startd[10]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.
May 12 03:10:40 fs0 klmmod: [ID 432453 kern.notice] NOTICE: Failed to initialize NSM handler (error=2)
May 12 03:10:40 fs0 /usr/lib/nfs/lockd[756]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
May 12 03:10:40 fs0 svc.startd[10]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.
May 12 03:11:30 fs0 klmmod: [ID 432453 kern.notice] NOTICE: Failed to initialize NSM handler (error=2)
May 12 03:11:30 fs0 /usr/lib/nfs/lockd[782]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
May 12 03:11:30 fs0 svc.startd[10]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.
May 12 03:11:30 fs0 svc.startd[10]: [ID 748625 daemon.error] network/nfs/nlockmgr:default failed: transitioned to maintenance (see 'svcs -xv' for details)
May 12 03:11:30 fs0 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
May 12 03:11:30 fs0 EVENT-TIME: Mon May 12 03:11:30 UTC 2014
May 12 03:11:30 fs0 PLATFORM: VMware-Virtual-Platform, CSN: VMware-56-4d-93-dd-cc-cb-b7-51-aa-13-75-1c-26-27-b1-10, HOSTNAME: fs0
May 12 03:11:30 fs0 SOURCE: software-diagnosis, REV: 0.1
May 12 03:11:30 fs0 EVENT-ID: 0ecec43e-1796-e1a9-c56d-b074484b06f0
May 12 03:11:30 fs0 DESC: A service failed - a method is failing in a retryable manner but too often.
May 12 03:11:30 fs0 Refer to http://illumos.org/msg/SMF-8000-YX for more information.
May 12 03:11:30 fs0 AUTO-RESPONSE: The service has been placed into the maintenance state.
May 12 03:11:30 fs0 IMPACT: svc:/network/nfs/nlockmgr:default is unavailable.
May 12 03:11:30 fs0 REC-ACTION: Run 'svcs -xv svc:/network/nfs/nlockmgr:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted.

#6

Updated by Marcel Telka about 5 years ago

  • Assignee set to Marcel Telka

Is somebody able to reproduce this at will?
If so, can I get access to such a machine?
Or, are there some steps how to reproduce it anywhere?

Thanks.

#7

Updated by Marcel Telka about 5 years ago

  • Status changed from New to Feedback
#8

Updated by Marcel Telka about 5 years ago

Igor Kozhukhov wrote:

Marcel Telka wrote:

Are you able to reproduce this issue?

i have the same issue.
easy to reproduce by create some datasets on different pool with some included datasets - for delay with 'mount all' operation.
i have about 300 mount with enabled NFS share on datasets.

I tried to create > 1100 nested datasets with sharenfs=on, then I rebooted the machine and lockd started without any problem.

#9

Updated by Mr. Jester about 5 years ago

It occurs everytime I reboot my server. We can workout some access to the system.

#10

Updated by Igor Kozhukhov about 5 years ago

Marcel Telka wrote:

Igor Kozhukhov wrote:

Marcel Telka wrote:

Are you able to reproduce this issue?

i have the same issue.
easy to reproduce by create some datasets on different pool with some included datasets - for delay with 'mount all' operation.
i have about 300 mount with enabled NFS share on datasets.

I tried to create > 1100 nested datasets with sharenfs=on, then I rebooted the machine and lockd started without any problem.

i have about 300 datasets on different zpools.
i have:
- rpool with boot datasets
- data3 zpool with combined datasets with enabled compressions/dedup, zones, build env without compressions/dedup
- data6 with zones, xen VMs, archive with dedup
i have enabled sharenfs + sharesmb on datasets for access from zones and from Mac
i can see issues with lockd every time with system reboot

#11

Updated by Marcel Telka about 5 years ago

Mr. Jester or Igor, please contact me privately (via email or IRC) to arrange the access to your system. I'll need the root access and I'll reboot the machine several times to see the problem manifested. Thanks.

#12

Updated by Marcel Telka about 5 years ago

  • Status changed from Feedback to In Progress

With help from Mr. Jester I'm now able to reproduce the issue on my machine. Thanks.

#13

Updated by J Y about 5 years ago

what's the core repro steps? in having the same issue on smartos from July. it's annoying to have to perform manual intervention on reboots.

Marcel Telka wrote:

With help from Mr. Jester I'm now able to reproduce the issue on my machine. Thanks.

#14

Updated by Marcel Telka about 5 years ago

To reproduce the issue you need to have slow statd. To achieve that you could use the following patch:

diff --git a/usr/src/cmd/fs.d/nfs/statd/sm_proc.c b/usr/src/cmd/fs.d/nfs/statd/s
index 22592eb..813f253 100644
--- a/usr/src/cmd/fs.d/nfs/statd/sm_proc.c
+++ b/usr/src/cmd/fs.d/nfs/statd/sm_proc.c
@@ -1135,6 +1135,8 @@ finish:
                free(lifc->lifc_buf);
                free(lifc);
        }
+
+       sleep(300);
 }

 /*

Then built the statd binary. Disable the nfs/status and nfs/nlockmgr services and run the following commands:

./statd
/usr/lib/nfs/lockd

Once the lockd goes to background (you need to wait for a while) you will see the following in /var/adm/messages:

Sep  2 18:26:01 openindiana klmmod: [ID 432453 kern.notice] NOTICE: Failed to initialize NSM handler (error=2)
Sep  2 18:26:01 openindiana /usr/lib/nfs/lockd[883]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
#15

Updated by Mr. Jester about 5 years ago

What are potential causes of the slow statd? Is there a fix in the works?

#16

Updated by Jeffrey Vandenbroucke almost 5 years ago

Chiming in. Same issue here running DilOS (illumos) on a VM with a LSI 9211-8i in DirectIO (passthru).
Specs and details of the config on ZeroBin: http://paste.ec/?39c11a5337a8f995#ARCJIfcHRoi12OathbO6F4zMnhlcojer1If9uQ6HvnM=

#17

Updated by Dan Vatca almost 5 years ago

I managed to reproduce the issue here. Statd is trying to recover after a crash while it has this link in /var/statmon/sm.bak:

root@danbuild:/var/statmon/sm.bak# ls -la /var/statmon/sm.bak/
total 2
drwxr-xr-x 2 daemon daemon  3 Dec  3 12:33 .
drwxr-xr-x 4 daemon daemon  5 Dec  3 12:24 ..
lrwxrwxrwx 1 daemon daemon  18 Dec  3 12:33 ipv4.192.168.1.100 -> dan.dev.syneto.net

The issue is that the IP 192.168.1.100 is not responding (in this case, my client had a different IP address while statd was starting).
The workaround is to remove the stale link and restart statd:
root@danbuild:/var/statmon/sm.bak# rm -f /var/statmon/sm.bak/ipv4.192.168.1.100
root@danbuild:/var/statmon/sm.bak# svcadm restart nfs/status
root@danbuild:/var/statmon/sm.bak# svcadm clear nlockmgr
root@danbuild:/var/statmon/sm.bak# svcs nlockmgr
STATE          STIME    FMRI
online         12:40:55 svc:/network/nfs/nlockmgr:default

It seems that statd's crash recovery process is slow because hosts that were connected during the crash were are no longer available while it was trying to recover.

#18

Updated by John Klimek almost 5 years ago

Dan, thanks for your information. I had the same problem and removing some stale (?) files in the sm.bak directory completely fixed my problem.

#19

Updated by Mr. Jester almost 5 years ago

I did not have these files on my system the last time I ran into the problem.

#20

Updated by Paul Orlando over 4 years ago

I have the same issue. I upgraded the OS (SmartOS) to the latest build (20141212T011737Z), and suddenly none of my NFS clients can connect to the server because nlockmgr (and nfs/server) no longer starts:

2014-12-25T12:26:18.617042+00:00 mn01 klmmod: [ID 814159 kern.notice] NOTICE: Failed to connect to local statd (rpcerr=5)#012
2014-12-25T12:26:18+00:00 mn01 /usr/lib/nfs/lockd[2377]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting
2014-12-25T12:26:18+00:00 mn01 svc.startd[8]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.

I deleted the file Dan mentioned earlier and rebooted the machine. NFS came up without any issues. Deleting /var/statmon/sm.bak/ipv4.192.168.1.50 did the trick.

However, today I was messing around with networking and I got myself locked out, unable to ssh into machine. I connected a monitor and keyboard to the server in order to solve the problem. Also I disconnected the network cables from the machine during troubleshooting. After reboot, the nlockmgr service does not start, again. The error now is slightly different:

2014-12-29T17:08:37.058039+00:00 mn01 genunix: [ID 432453 kern.notice] NOTICE: Failed to initialize NSM handler (error=2)#012
2014-12-29T17:08:37+00:00 mn01 /usr/lib/nfs/lockd[2540]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
2014-12-29T17:08:37+00:00 mn01 svc.startd[8]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.

This time no files can be found in /var/statmon/sm.bak. Restarting nfs/status and clearing nlockmgr had no effect.

#21

Updated by Paul Dagnelie over 4 years ago

We've been looking at these issues at Delphix, and have discovered that there are two similar-looking issues. The first is the one already described, wherein hosts that are no longer accessible but still have files in the sm and sm.back directories cause statd to hang long enough on startup that nlockmgr times out. One way to fix that might be to rearchitect statd so that nlockmgr can safely move past sm_simu_crash_svc before statd has finished issuing SM_NOTIFY messages to all the hosts that it has records for. This presents with "NOTICE: Failed to connect to local statd (rpcerr=5)"

The other issue appears to be related to DNS (or at least that's the easiest way to reproduce it). If the dns server is unreachable on boot (or any time name-service-cache, statd, and nlockmgr all restart at the same time), then statd takes long enough to finish startup that nlockmgr times out. This one presents with "NOTICE: Failed to initialize NSM handler (error=2)".

#22

Updated by Marcel Telka over 4 years ago

  • Subject changed from /usr/lib/nfs/lockd: [daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting to lockd: Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting
#23

Updated by Marcel Telka over 4 years ago

statd startup

The statd during the startup (in the main function in sm_svc.c) does this:

  1. Initial setup (up to line 552). Here the options are parsed and some other initial setup is done.
  2. Fork (lines 554 to 570).
  3. Daemon lock setup (lines 572 to 589).
  4. Merges (lines 591 to 595).
  5. RPC setup (lines 597 to 625). Here the statd registers itself with rpcbind.
  6. Some filesystem work (lines 627 to 653).
  7. Globals initialization (lines 655 to 661).
  8. Crash (lines 662 to 669).
  9. RPC service start (from line 671).

All of the above should be fast and straightforward, with exception of the Merges (lines 591 to 595), and partially Crash (lines 662 to 669). Both of these areas could cause problems with the lockd start.

After the Fork (lines 554 to 570), the smf considers the svc:/network/nfs/status:default service as properly started, so it will start the svc:/network/nfs/nlockmgr:default now (in a case we are starting both nfs/status and nfs/nlockmgr, which is usual).

lockd startup

When the lockd (the nfs/nlockmgr service) starts, basically the nlm_svc_starting() is called and two tasks related to statd are done. The first one is obtaining the address of the statd as registered with rpcbind. This is done in nlm_nsm_init(). If this step fails (after several retries), the ENOENT error is returned and we will see this error in the messages:

Sep  2 18:26:01 openindiana /usr/lib/nfs/lockd[883]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting

In a case the nlm_nsm_init() succeeds (called from nlm_nsm_init_local()), the nlm_svc_starting() proceeds and does the second task related to statd: simulates the NSM crash by calling nlm_nsm_simu_crash(). If the nlm_nsm_simu_crash() fails, EIO error is returned and we will see the following in the messages:

Jan 18 16:31:55 batfs9920 /usr/lib/nfs/lockd[3527]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting

The ENOENT error

In a case the statd is slow in the Merges (lines 591 to 595), the lockd won't be able to obtain the registered address of the RPC service from rpcbind, so nlm_nsm_init() will fail with ENOENT. Remember, the Merges (lines 591 to 595) are done after the Fork (lines 554 to 570) - so lockd just started - but before the RPC setup (lines 597 to 625), so statd is not registered with the rpcbind yet.

The EIO error

The statd during the Crash (lines 662 to 669) first sets in_crash to 1 and then calls statd_init(). The statd_init() does some filesystem tasks and finally creates the new thread - thr_statd_init(). All of this should be fast, the only slow part here could be the thr_statd_init() thread itself.

The thr_statd_init() thread notifies the other machines' statd using the worker threads (the number of worker threads is up to 25 - MAX_THR). Once the notification is finished, the in_crash is cleared (line 429 in sm_statd.c) and the crash_finish condition variable is signaled.

The in_crash (together with the crash_finish condition variable) is used as a serialization for sm_simu_crash_svc(), so only one thread could do the actual crash and notification of the other statd:s. If the in_crash is set to 1, the other threads will just wait at the begining of the sm_simu_crash_svc() until the actual notification in progress is finished. If the actual notification in progress is slow, the sm_simu_crash_svc() will wait for very long time and the nlm_nsm_simu_crash() call will just timeout and return back, so the EIO is returned and lockd will fail.

#24

Updated by Marcel Telka over 4 years ago

  • Subject changed from lockd: Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting to lockd: Cannot establish NLM service over <file desc. 9, protocol udp>
#25

Updated by Marcel Telka over 4 years ago

The EIO error - timeouts

The notification worker threads in statd (the function thr_call_statd()) notifies the other party by calling the statd_call_statd(). In a case the remote machine is not reachable at all (for any reason) the statd_call_statd() will return after more than 450 seconds. You could confirm that by running this:

# svcadm disable nfs/status
# ln -s test /var/statmon/sm.bak/ipv4.10.0.0.92
# dtrace -n 'pid$target::statd_call_statd:entry{self->t=timestamp} pid$target::statd_call_statd:return {trace((timestamp-self->t)/1000/1000/1000)}' -c "/usr/lib/nfs/statd -d 1" 
dtrace: description 'pid$target::statd_call_statd:entry' matched 2 probes
debug is on, create entry: /var/statmon/sm, /var/statmon/sm.bak, /var/statmon/state
openindiana openindiana.local localhost loghost openindiana.local localhost loghost 10.0.2.15 0.0.0.0 fe80::a00:27ff:fe6f:fe92 fe80::a00:27ff:fe26:12c3 Setting owner for /var/statmon
Initializing hash tables
enter statd_init
local state = 35
Starting svc_run
statd_call_statd at ipv4.10.0.0.92
statd_call_statd: calling create_client(10.0.0.92)
statd call failed, inserting ipv4.10.0.0.92 in recov_q
insert_name: inserted ipv4.10.0.0.92 at 806ae18
*****recovery_q: (806d238), 
Creating thread for sm_try
statd_call_statd at ipv4.10.0.0.92
statd_call_statd: calling create_client(10.0.0.92)
CPU     ID                    FUNCTION:NAME
  0  64984          statd_call_statd:return               474
  0  64984          statd_call_statd:return               474
statd_call_statd at ipv4.10.0.0.92
statd_call_statd: calling create_client(10.0.0.92)
^C

#

OTOH, the timeout in nlm_nsm_simu_crash() is 25 seconds only:

# dtrace -n 'nlm_nsm_simu_crash:entry{self->t=timestamp} nlm_nsm_simu_crash:return{trace((timestamp-self->t)/1000/1000)}' -c "/usr/lib/nfs/lockd -d 1" 
dtrace: description 'nlm_nsm_simu_crash:entry' matched 2 probes
/usr/lib/nfs/lockd: debug= 1, conn_idle_timout= 300, grace_period= 90, listen_backlog= 32, max_connections= -1, max_servers= 20, retrans_timeout= 5
dtrace: pid 5176 exited with status 1
CPU     ID                    FUNCTION:NAME
  0  61965        nlm_nsm_simu_crash:return             24996

#

So in a case we have at least one unreachable remote host, we will see the EIO error on the lockd startup.

The EIO error - the fix

Obviously, the direct root cause for the EIO error is this waiting (line 195) in sm_simu_crash_svc():

183void
184sm_simu_crash_svc(void *myidp)
185{
186    int i;
187    struct mon_entry *monitor_q;
188    int found = 0;
189
190    /* Only one crash should be running at a time. */
191    mutex_lock(&crash_lock);
192    if (debug)
193        (void) printf("proc sm_simu_crash\\n");
194    if (in_crash) {
195        cond_wait(&crash_finish, &crash_lock);
196        mutex_unlock(&crash_lock);
197        return;
198    } else {
199        in_crash = 1;
200    }
201    mutex_unlock(&crash_lock);

To fix the EIO problem we should just remove the line 195, so we won't be waiting for the currently running notification threads to finish.

Why we could (and should) do that? There are several reasons:

  1. The first (and obvious) reason is to fix the EIO error. In a case the statd won't wait in sm_simu_crash_svc(), the nlm_nsm_simu_crash() won't timeout, so the nlm_svc_starting() won't fail.
  2. The lockd cannot wait for the statd to finish the notification of all remote machines. Even in a case the statd is not so slow with the notifications as we saw above (more than 450 seconds), but faster, let say 20 seconds, the lockd should be immediately prepared for incoming lock reclaims from the remote hosts. With the current waiting in the sm_simu_crash_svc() the lockd will start to accept the lock reclaims only after all remote machines are notified. In a case there is some remote machine with short timeout (less than 20 seconds), it will fail to reclaim the lock just because our lockd is not prepared for the lock reclaims yet, because it is waiting for the statd to finish all notifications.
  3. The waiting in sm_simu_crash_svc() is not fair. The first caller to sm_simu_crash_svc() will just set in_crash to 1, do some other (fast) work, then call sm_crash(), do some (fast) work in sm_crash() and call statd_init(). The statd_init() is fast as well and at the end of it the thr_statd_init() thread is spawned. So the first caller will just return fast, without the actual waiting for the notifications to be finished. This is not true for all subsequent sm_simu_crash_svc() caller. All of them will just wait until all notifications are finished. This is not fair and there is no reason for that.
  4. The waiting on the crash_finish condition variable is not realiable. The cond_wait() could return also in a case the crash_finish variable was not signalled. We should re-check the in_crash value once the cond_wait() return, but we don't do so.
#26

Updated by Gordon Ross over 4 years ago

Thanks for the thorough analysis. I have some design change recommendations:

statd, nlm startup sequence:

Regarding the first problem you described:

After the Fork (lines 554 to 570), the smf considers the svc:/network/nfs/status:default service as properly started, so it will start the svc:/network/nfs/nlockmgr:default now (in a case we are starting both nfs/status and nfs/nlockmgr, which is usual).

This sounds like statd is breaking one of the rules governing it's relationship with SMF, which is that a new service should not tell SMF that the service has "started" until it's truly ready to begin operations. In this case, that means it should have it's RPC service listener endpoints bound and usable before the SMF start method process exits.

Many SMF services use a special handshake between a child and grandchild process to implement all this startup sequencing in a way that tells SMF it's started only when it's actually ready for service. One example I happen to be familiar with is the function daemonize_init(), found here
http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/fs.d/nfs/lib/daemon.c#55

It looks like statd is not using the daemonize_init / daemonize_fini mechanism. I recommend adding this, and make sure it calls daemonize_fini only after the RPC service listener is running. That way, SMF will sequence the NLM start to begin only after this service is truly ready.

the EIO error

Where NLM calls that statd sm_simu_crash_svc(), there are a couple problems as you describe in detail.
I see a couple problems here.
  • RPC calls are not supposed to be long running operations (or else one will get timeouts as observed here).
  • After NLM starts up and tells statd to "simulate a crash" (asking statd to notify all former peers that NLM restarted),
    the NLM needs to remain in "grace" mode (no new locks, reclaim only) until statd finishes notifying all our peers. Technically, the "grace period" timer should start only after notifications are complete.

Given those constraints, I suggest using a pair of one-way calls here. (Of course, these are only conceptually one-way calls, because RPC always has a return, but one-way calls ignore the returns.) This would start with NLM using SM_SIMU_CRASH to ask statd to begin a "simulated crash" (notify all our peers of a restart). The SM_SIMU_CRASH call would return immediately, and then NLM would enter "grace" mode, waiting for a notification to come back from statd. When statd is finished sending notifications, it would call back to NLM to notify it that the simulated crash notification work is complete. There are already two special RPC calls defined in the NLM protocol (NLM_SM_NOTIFY1, NLM_SM_NOTIFY2) so it should be possible to use the second of these for this "reply" one-way call (the first is already in use, the second is not).

#27

Updated by Gordon Ross over 4 years ago

Some other observations:

If we go with the one-way calls design I suggested in the previous comment, sm_proc.c:sm_simu_crash_svc() will need to work asynchronously so it can return immediately. Here I would suggest a single "simu_crash" worker thread that may or may not exist. The thread ID stored in a global (simu_crash_tid?) consulted by sm_simu_crash_svc(). If the simu_crash thread is already running, (simu_crash_tid is set) just return, otherwise start the simu_crash thread and save its TID in simu_crash_tid (with appropriate mutex locks etc.). The variable "in_crash" could either be replaced by tests of simu_crash_tid, or the simu_crash thread could set in_crash while it runs and clear it at the end. Lastly, the new simu_crash thread should call NLM_SM_NOTIFY2 when notifications have all been sent.

We could also just do without the NLM_SM_NOTIFY2 call, and just let statd do it's notification work entirely asynchronous from the NLM. At worst, NLM might finish it's grace period (default = 90 sec.) before statd manages to send all the notifications. I'm not sure that would be worse than having NLM delay it's transition out of grace mode until statd finishes the simu_crash work. It would also simply things, of course. If we go this route, it might be good to have statd log an warning if simu_crash ever takes longer than grace_period. Maybe that's enough.

Other things I noticed:

The threads doing the actual notifications should use shorter timeouts than the defaults. If we're trying to notify a peer that it should reclaim locks and it's not responding to RPC in reasonable amounts of time, we really should just move along and notify our other hosts.
It also would be great if we could do all of these notifications in parallel, but that would probably mean more threads...

A taskq with a limited number of worker threads would be a lot easier than what sm_statd.c:thr_statd_init() is doing.
This code pre-dated task queues. However, that's probably another bug for another day.

#28

Updated by Marcel Telka over 4 years ago

Gordon Ross wrote:

Thanks for the thorough analysis. I have some design change recommendations:

statd, nlm startup sequence:

Regarding the first problem you described:

After the Fork (lines 554 to 570), the smf considers the svc:/network/nfs/status:default service as properly started, so it will start the svc:/network/nfs/nlockmgr:default now (in a case we are starting both nfs/status and nfs/nlockmgr, which is usual).

This sounds like statd is breaking one of the rules governing it's relationship with SMF, which is that a new service should not tell SMF that the service has "started" until it's truly ready to begin operations. In this case, that means it should have it's RPC service listener endpoints bound and usable before the SMF start method process exits.

Many SMF services use a special handshake between a child and grandchild process to implement all this startup sequencing in a way that tells SMF it's started only when it's actually ready for service. One example I happen to be familiar with is the function daemonize_init(), found here
http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/fs.d/nfs/lib/daemon.c#55

It looks like statd is not using the daemonize_init / daemonize_fini mechanism. I recommend adding this, and make sure it calls daemonize_fini only after the RPC service listener is running. That way, SMF will sequence the NLM start to begin only after this service is truly ready.

Yes, exactly. This premature statd(1m) start is partially responsible for the ENOENT error. The daemonize_init()/daemonize_fini() mechanism is used by nfsd, lockd, and mountd daemons, so we should do the same here with statd.

Unfortunately, this won't be enough to avoid the statd/lockd startup issues. With the daemonize_init()/daemonize_fini() mechanism the parent statd process would wait until the child statd process finishes the whole start sequence from the Fork till the Crash (inclusively). Everything in the start sequence is fast, except the Merges (lines 591 to 595). The Merges could easily last far more than 60 seconds (the start timeout for statd defined in the smf manifest), so the parent statd process would need to wait so long as well and the smf would consider the statd startup as failing. In this case we will just push the lockd startup problem to statd, which is ... not very good.

The ENOENT error - timeouts

The Merges (lines 591 to 595) in statd could be really slow. To see how slow it could be you could do the following:

First, replace the DNS server address in /etc/resolv.conf with any unreachable address, or simply shutdown your DNS server. Then do this:

# svcadm disable svc:/network/nfs/status:default
# svcadm restart svc:/system/name-service-cache:default
# dtrace -n 'pid$target::merge_hosts:entry{self->t=timestamp} pid$target::merge_hosts:return{trace((timestamp-self->t)/1000/1000/1000)}' -c "/usr/lib/nfs/statd -d 1" 
dtrace: description 'pid$target::merge_hosts:entry' matched 2 probes
debug is on, create entry: /var/statmon/sm, /var/statmon/sm.bak, /var/statmon/state
CPU     ID                    FUNCTION:NAME
  0  64984               merge_hosts:return               180
openindiana openindiana.local localhost loghost openindiana.local localhost loghost 10.0.2.15 0.0.0.0 fe80::a00:27ff:fe6f:fe92 fe80::a00:27ff:fe26:12c3 Setting owner for /var/statmon
Initializing hash tables
enter statd_init
local state = 73
Starting svc_run
Creating thread for sm_try
EXITING sm_try
^C

#

We see that the statd spent exactly 3 minutes (180 seconds) in the merge_hosts().

The ENOENT error - the fix

The fix will be two fold. The first part of the fix will include the daemonize_init()/daemonize_fini() mechanism (as described by Gordon above). This will make sure that smf won't start lockd too early; before statd is prepared for the actual work.

The second part of the fix needs to somehow eliminate the slow Merges from the main code path. As a result of the Merges the host_name global variable is populated (via varius add_to_host_array() calls from both merge_hosts() and merge_ips() functions). Who and how is using the host_name?

The host_name global variable is used by one consumer only - the statd_call_statd() function. The statd_call_statd() function is called from two other functions: thr_call_statd() and sm_try(). Both of these functions are used as a thread start functions for threads spawn asynchronously during the crash simulation.

So, it should be possible to move the Merges from statd main() to a separate thread and make the statd_call_statd() just wait until the Merges thread finishes the population of the host_name global variable. This will obviously:

  1. Solve the slow statd issue, because the only slow part there (the Merges) will run asynchronously.
  2. Make the crash notification of other statd:s even slower than it is now, because the notification worker threads will need to wait longer, until the asynchronous Merges are done. This is real problem, but not critical, since the worker threads might be slow even without waiting for the Merges (as we saw in the EIO error case).
#29

Updated by Gordon Ross over 4 years ago

When "Merges" takes a long time, that's a DNS problem. I'm not sure I like the idea of trying to work around that in statd this way.
I talked with Marcel about this. As long as nothing needs the interface names and addresses until later, the "Merges" work can happen asynchronously after the synchronous parts of initialization finish and tell SMF that the service is "ready". The caution is (and the complexity I'm not thrilled with) is that a simu_crash (and maybe other things) that may try to run shortly after startup will need a way to synchronize, waiting for completion of the async parts of initialization to finish.

#30

Updated by Marcel Telka over 4 years ago

Gordon Ross wrote:

the EIO error

Where NLM calls that statd sm_simu_crash_svc(), there are a couple problems as you describe in detail.
I see a couple problems here.
  • RPC calls are not supposed to be long running operations (or else one will get timeouts as observed here).

Yes, exactly.

The sm_simu_crash_svc() function is usually fast. It is slow only in a case when the sm_simu_crash_svc() is called before the previous notification task (implemented in thr_statd_init() and its workers in thr_call_statd()) is completed.

  • After NLM starts up and tells statd to "simulate a crash" (asking statd to notify all former peers that NLM restarted),
    the NLM needs to remain in "grace" mode (no new locks, reclaim only) until statd finishes notifying all our peers. Technically, the "grace period" timer should start only after notifications are complete.

Given those constraints, I suggest using a pair of one-way calls here. (Of course, these are only conceptually one-way calls, because RPC always has a return, but one-way calls ignore the returns.) This would start with NLM using SM_SIMU_CRASH to ask statd to begin a "simulated crash" (notify all our peers of a restart). The SM_SIMU_CRASH call would return immediately, and then NLM would enter "grace" mode, waiting for a notification to come back from statd. When statd is finished sending notifications, it would call back to NLM to notify it that the simulated crash notification work is complete. There are already two special RPC calls defined in the NLM protocol (NLM_SM_NOTIFY1, NLM_SM_NOTIFY2) so it should be possible to use the second of these for this "reply" one-way call (the first is already in use, the second is not).

Yes, this seems to be a problem, but this problem is not related to the lockd startup failure (this bug) and it should be fixed as a separate bug.

#31

Updated by Marcel Telka over 4 years ago

Gordon Ross wrote:

Some other observations:

If we go with the one-way calls design I suggested in the previous comment, sm_proc.c:sm_simu_crash_svc() will need to work asynchronously so it can return immediately. Here I would suggest a single "simu_crash" worker thread that may or may not exist. The thread ID stored in a global (simu_crash_tid?) consulted by sm_simu_crash_svc(). If the simu_crash thread is already running, (simu_crash_tid is set) just return, otherwise start the simu_crash thread and save its TID in simu_crash_tid (with appropriate mutex locks etc.). The variable "in_crash" could either be replaced by tests of simu_crash_tid, or the simu_crash thread could set in_crash while it runs and clear it at the end. Lastly, the new simu_crash thread should call NLM_SM_NOTIFY2 when notifications have all been sent.

We could also just do without the NLM_SM_NOTIFY2 call, and just let statd do it's notification work entirely asynchronous from the NLM. At worst, NLM might finish it's grace period (default = 90 sec.) before statd manages to send all the notifications. I'm not sure that would be worse than having NLM delay it's transition out of grace mode until statd finishes the simu_crash work. It would also simply things, of course. If we go this route, it might be good to have statd log an warning if simu_crash ever takes longer than grace_period. Maybe that's enough.

This describes how it will work once I implement the EIO error fix as I proposed above. The only difference is that I won't use the end notification (you proposed the NLM_SM_NOTIFY2), and the statd won't log anything in a case the notification takes longer that the grace_period (this is another issue).

Other things I noticed:

The threads doing the actual notifications should use shorter timeouts than the defaults. If we're trying to notify a peer that it should reclaim locks and it's not responding to RPC in reasonable amounts of time, we really should just move along and notify our other hosts.
It also would be great if we could do all of these notifications in parallel, but that would probably mean more threads...

A taskq with a limited number of worker threads would be a lot easier than what sm_statd.c:thr_statd_init() is doing.
This code pre-dated task queues. However, that's probably another bug for another day.

Yes, this is another bug.

#32

Updated by Marcel Telka over 4 years ago

Testing of the fix

Note: 10.0.0.99 is IP address of a non-existing host.

Without the fix:

# svcadm disable nfs/nlockmgr
# svcadm disable nfs/status
# ln -s test /var/statmon/sm.bak/ipv4.10.0.0.99
# svcadm enable nfs/status ; svcadm enable nfs/nlockmgr
# sleep 70
# svcs -xv
svc:/network/nfs/nlockmgr:default (NFS lock manager)
 State: maintenance since Wed Mar 25 16:45:58 2015
Reason: Start method failed repeatedly, last exited with status 1.
   See: http://illumos.org/msg/SMF-8000-KS
   See: man -M /usr/share/man -s 1M lockd
   See: /var/svc/log/network-nfs-nlockmgr:default.log
Impact: This service is not running.
# grep "NLM service" /var/adm/messages | tail -1
Mar 25 16:45:58 openindiana /usr/lib/nfs/lockd[7174]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting
#
# svcadm disable nfs/nlockmgr
# svcadm disable nfs/status
# gsed -i -e 's/^nameserver.*$/nameserver 10.0.0.99/' /etc/resolv.conf
# svcadm enable nfs/status ; svcadm enable nfs/nlockmgr
# sleep 70
# svcs -xv
svc:/network/nfs/nlockmgr:default (NFS lock manager)
 State: maintenance since Wed Mar 25 16:55:03 2015
Reason: Start method failed repeatedly, last exited with status 1.
   See: http://illumos.org/msg/SMF-8000-KS
   See: man -M /usr/share/man -s 1M lockd
   See: /var/svc/log/network-nfs-nlockmgr:default.log
Impact: This service is not running.
# grep "NLM service" /var/adm/messages | tail -1
Mar 25 16:55:03 openindiana /usr/lib/nfs/lockd[7233]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
#

With the fix:

# svcadm disable nfs/nlockmgr
# svcadm disable nfs/status
# ln -s test /var/statmon/sm.bak/ipv4.10.0.0.99
# svcadm enable nfs/status ; svcadm enable nfs/nlockmgr
# sleep 70
# svcs -xv
# grep "NLM service" /var/adm/messages | tail -1
#
# svcadm disable nfs/nlockmgr
# svcadm disable nfs/status
# gsed -i -e 's/^nameserver.*$/nameserver 10.0.0.99/' /etc/resolv.conf
# svcadm enable nfs/status ; svcadm enable nfs/nlockmgr
# sleep 70
# svcs -xv
# grep "NLM service" /var/adm/messages | tail -1
#
#33

Updated by Marcel Telka over 4 years ago

  • Status changed from In Progress to Pending RTI
#34

Updated by Marcel Telka over 4 years ago

  • Status changed from Pending RTI to Resolved
  • % Done changed from 0 to 100

git commit 98573c1925f3692d1e8ea9eb018cb915fc0becc5

commit 98573c1925f3692d1e8ea9eb018cb915fc0becc5
Author: Marcel Telka <marcel.telka@nexenta.com>
Date:   Fri Mar 27 15:18:01 2015 +0100

    4518 lockd: Cannot establish NLM service over <file desc. 9, protocol udp>
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

#35

Updated by Electric Monk over 4 years ago

  • Status changed from Resolved to Closed
#37

Updated by Marcel Telka about 3 years ago

  • Related to Bug #7231: nlockmgr failing to start up during bootup added

Also available in: Atom PDF