nlockmgr failing to start up during bootup
We originally noticed this as a failure to mount an NFS share, caused by the nlockmgr service not being up. I ran "svcadm clear nlockmgr", and then everything worked fine.
I've narrowed down the list of potential failure points based on the logs. We got the message "NOTICE: Failed to initialize NSM handler (error=2)", so it seems we must have failed nlm_nsm_init_local with ENOENT. There's a few places that can result in ENOENT from there, however.
1) nlm_init_local_knc can return ENOENT if lookupname fails to find "/dev/tcp", which admittedly seems unlikely.
2) nlm_nsm_init has two possible failure modes; if rpcbind_getaddr fails with anything aside from RPC_PROGNOTREGISTERED, or
3) we use up all our retries, then we return ENOENT.
It looks like we've either failed to do rpcbind_getaddr for the statd service, or we've waited 50 seconds for them to register and they haven't yet.
I've been able to reliably reproduce the errors in the lab with a broken DNS configuration, specifically one where there is nothing at the IP of the nameserver, so traffic sent to it is never responded to.
Interestingly, with non-responsive DNS, I observe /lib/svc/method/nlockmgr exit 1 twice, then exit 0, consistently 00:02:01 after it is enabled. With nameserver pointing to the Delphix Engine itself, resulting in queries being immediately rejected, I get an exit 0 after just 15 seconds, no failures. In both cases however, nlockmgr does in fact start. But this does suggest that some kind of timeout is involved when it falls into maintenance.
At first, we could reliably reproduce this, and that allowed us to determine the root cause: an unreachable DNS server causes statd to hang for a while on boot doing reverse DNS lookups. The only fix, therefore, is to modify the nlockmgr timeout so that it will patiently wait for statd to finish booting. Unfortunately, around this time, we lost the ability to reproduce the bug; we suspect there are subtle interactions in service startup to get it to fully trigger. As a result, we're not sure exactly how long the timeout needs to be in order to have nlockmgr start up successfully. This means that the timeout needs to be a tunable; we can increase the default, but should make it tunable in case it has to be raised on a case-by-case basis. That's what the fix does; we add a loop in the service startup script that waits until statd is advertising an rpc address. By increasing the service timeout, we can change how long we're willing for nlockmgr to wait for statd to finish startup.