Project

General

Profile

Actions

Bug #7231

closed

nlockmgr failing to start up during bootup

Added by Daniel Kimmel over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2016-07-28
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

We originally noticed this as a failure to mount an NFS share, caused by the nlockmgr service not being up. I ran "svcadm clear nlockmgr", and then everything worked fine.


I've narrowed down the list of potential failure points based on the logs. We got the message "NOTICE: Failed to initialize NSM handler (error=2)", so it seems we must have failed nlm_nsm_init_local with ENOENT. There's a few places that can result in ENOENT from there, however.
1) nlm_init_local_knc can return ENOENT if lookupname fails to find "/dev/tcp", which admittedly seems unlikely.
2) nlm_nsm_init has two possible failure modes; if rpcbind_getaddr fails with anything aside from RPC_PROGNOTREGISTERED, or
3) we use up all our retries, then we return ENOENT.
It looks like we've either failed to do rpcbind_getaddr for the statd service, or we've waited 50 seconds for them to register and they haven't yet.


I've been able to reliably reproduce the errors in the lab with a broken DNS configuration, specifically one where there is nothing at the IP of the nameserver, so traffic sent to it is never responded to.

Interestingly, with non-responsive DNS, I observe /lib/svc/method/nlockmgr exit 1 twice, then exit 0, consistently 00:02:01 after it is enabled. With nameserver pointing to the Delphix Engine itself, resulting in queries being immediately rejected, I get an exit 0 after just 15 seconds, no failures. In both cases however, nlockmgr does in fact start. But this does suggest that some kind of timeout is involved when it falls into maintenance.


At first, we could reliably reproduce this, and that allowed us to determine the root cause: an unreachable DNS server causes statd to hang for a while on boot doing reverse DNS lookups. The only fix, therefore, is to modify the nlockmgr timeout so that it will patiently wait for statd to finish booting. Unfortunately, around this time, we lost the ability to reproduce the bug; we suspect there are subtle interactions in service startup to get it to fully trigger. As a result, we're not sure exactly how long the timeout needs to be in order to have nlockmgr start up successfully. This means that the timeout needs to be a tunable; we can increase the default, but should make it tunable in case it has to be raised on a case-by-case basis. That's what the fix does; we add a loop in the service startup script that waits until statd is advertising an rpc address. By increasing the service timeout, we can change how long we're willing for nlockmgr to wait for statd to finish startup.


Related issues

Related to illumos gate - Bug #4518: lockd: Cannot establish NLM service over <file desc. 9, protocol udp>ClosedMarcel Telka2014-01-23

Actions
Actions #1

Updated by Marcel Telka over 7 years ago

  • Related to Bug #4518: lockd: Cannot establish NLM service over <file desc. 9, protocol udp> added
Actions #2

Updated by Paul Dagnelie over 7 years ago

It's been a while since I worked on this, but as I recall, statd_init (which is called before daemonize_fini()) kicks off a worker thread to run thr_statd_init. That is the thread that creates threads which attempt to contact every host that was holding a lock when we crashed (thr_call_statd, calling statd_call_statd). Only when that thr_statd_init has joined all of its threads does it clear the in_crash and die variables. IIRC, those are the variables which prevent the service from actually doing work while they're set (I don't recall the exact mechanism by which that occurs). We could have the main thread wait until all this work has been done before it calls daemonize_fini, but then we risk hitting the service startup timeout any time we have a lot of hosts to contact. It's possible that a variant on this approach would be the ideal solution, but we lost the ability to reproduce this bug, so we couldn't test any other approaches.

Actions #3

Updated by Electric Monk over 7 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 7b4e6d92ac8978742d2afae42fe91917af8b8938

commit  7b4e6d92ac8978742d2afae42fe91917af8b8938
Author: Paul Dagnelie <pcd@delphix.com>
Date:   2016-08-29T03:37:37.000Z

    7231 nlockmgr failing to start up during bootup
    Reviewed by: Matt Amdur <matt.amdur@delphix.com>
    Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
    Reviewed by: Kyle Cackett <kyle.cackett@delphix.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF