Bug #13326
opennscd stops returning results from LDAP until it is restarted (GSSAPI)
0%
Description
I have multiple hosts with computer account keytabs that use ldapclient to get passwd/group information from Active Directory (well, technically samba). When using the sasl/GSSAPI authentication method, nscd occasionally gets into a bad state where it stops returning any results from LDAP until the daemon is restarted.
This problem manifests with the following log messages:
nscd[99594]: [ID 293258 user.warning] libsldap: Status: 7 Mesg: LDAP ERROR (81): Can't contact LDAP server. nscd[99594]: [ID 702911 auth.notice] GSSAPI Error: The referenced context has expired (Unknown error)
At first I suspected that the host credential cache was stale, but this is not the case. Commands like ldaplist passwd
work fine, and klist
shows an up-to-date TGT and service ticket for the LDAP server. Restarting nscd via svcadm restart name-service-cache
always fixes the issue, but it often recurs after a few days.
When nscd gets into this state, getent passwd
returns instantly with all the entries from /etc/password, but the LDAP server is not even queried. When attaching to nscd with truss
or watching TCP traffic with snoop
, no connections to the LDAP server are attempted, even after invalidating the caches with nscd -i
. When attempting passwd lookups, you can watch the "negative misses" counter in the output of nscd -g
accumulate with each query.
Whenever this happens, netstat shows that the nscd process has a stale TCP socket to the LDAP server stuck in the CLOSE_WAIT state. This socket stays in CLOSE_WAIT until nscd is manually restarted. The LDAP server shows no open socket to the client when in this state, so I suspect that the server closed the connection, but ldapclient gets into some state where the descriptor is never cleaned up at the application level.
The following shows the stack of nscd while it's in this state. I can easily troubleshoot further, as this happens to at least one of my hosts per day.
99594: /usr/sbin/nscd --------------------- thread# 1 / lwp# 1 --------------------- feecca35 pause () 08059a93 main (8047d7c, fef48648) + 936 08058d57 _start_crt (1, 8047de4, fefd00e7, 0, 0, 0) + 96 08058c2a _start (1, 8047ecc, 0, 8047edb, 8047ef3, 8047efa) + 1a ------------- thread# 2 / lwp# 2 [set_smf_state] ------------- feecbef5 nanosleep (fe3bef98, fe3bef90) feeb7ab5 sleep (78) + 31 0806868b set_smf_state (0) + 3e feec7e31 _thrp_setup (fe2b0240) + 81 feec80e0 _lwp_start (fe2b0240, 0, 0, 0, 0, 0) --------------------- thread# 3 / lwp# 3 --------------------- feecd8f6 door (0, 0, 0, 0, 0, 8) feeb248d door_unref_func (1850a) + 4e feec7e31 _thrp_setup (fe2b0a40) + 81 feec80e0 _lwp_start (fe2b0a40, 0, 0, 0, 0, 0) ------------ thread# 4 / lwp# 4 [server_tsd_bind] ------------ feecd93e door (fe1abcb0, da, 0, fe1afe00, f5f00, a) 0806e54e lookup (fe1afd50, b0) + 144 0806f42c switcher (deadbeed, fe1afd50, b0, 0, 0) + 151 feecd95b __door_return () + 4b ---------------- thread# 5 / lwp# 5 [rts_mon] ---------------- feeccc65 read (4, fe0b0674, 94c) 0806e284 rts_mon () + 4a feec7e31 _thrp_setup (fe2b1a40) + 81 feec80e0 _lwp_start (fe2b1a40, 0, 0, 0, 0, 0) ------------ thread# 6 / lwp# 6 [server_tsd_bind] ------------ feecd93e door (fdf2dc80, cb, 0, fdfb1e00, f5f00, a) 0806e54e lookup (fdfb1d40, c0) + 144 0806f42c switcher (deadbeed, fdfb1d40, c0, 0, 0) + 151 feecd95b __door_return () + 4b ------------ thread# 7 / lwp# 7 [forker_monitor] ------------- feecd6f8 door (5, fde9ef48, 0, 0, 0, 3) feeb7cbf door_call (5, fde9ef48) + 82 080711ef _nscd_doorcall_fd (5, 6e000051, 0, 0, 0, 0, 0) + d9 0806be0f selfcred_pulse (5) + 3b 0806c95e forker_monitor (0) + 2e feec7e31 _thrp_setup (fe2b3240) + 81 feec80e0 _lwp_start (fe2b3240, 0, 0, 0, 0, 0) -------------- thread# 8 / lwp# 8 [revalidate] --------------- feecbef5 nanosleep (fdd9ff78, fdd9ff70) feeb7ab5 sleep (c8) + 31 0805bf87 revalidate (8155ac8) + 96 feec7e31 _thrp_setup (fe2b3a40) + 81 feec80e0 _lwp_start (fe2b3a40, 0, 0, 0, 0, 0) ---------------- thread# 9 / lwp# 9 [reaper] ----------------- feecbef5 nanosleep (fdca0f78, fdca0f70) feeb7ab5 sleep (12d) + 31 0805c904 reaper (8155ac8) + 1b9 feec7e31 _thrp_setup (fe2b4240) + 81 feec80e0 _lwp_start (fe2b4240, 0, 0, 0, 0, 0) ------------- thread# 10 / lwp# 10 [revalidate] -------------- feecbef5 nanosleep (fdba1f78, fdba1f70) feeb7ab5 sleep (c8) + 31 0805bf87 revalidate (8155948) + 96 feec7e31 _thrp_setup (fe2b4a40) + 81 feec80e0 _lwp_start (fe2b4a40, 0, 0, 0, 0, 0) --------------- thread# 11 / lwp# 11 [reaper] ---------------- feecbef5 nanosleep (fdaa2f78, fdaa2f70) feeb7ab5 sleep (12c) + 31 0805c920 reaper (8155948) + 1d5 feec7e31 _thrp_setup (fe2b5240) + 81 feec80e0 _lwp_start (fe2b5240, 0, 0, 0, 0, 0) -------------------- thread# 12 / lwp# 12 -------------------- feec8139 lwp_park (0, fd9a3f28, 0) feec1ec7 cond_wait_queue (feaa97f8, feaa9818, fd9a3f28) + 5f feec2224 cond_wait_common (feaa97f8, feaa9818, fd9a3f28) + 2ed feec22e2 __cond_timedwait (feaa97f8, feaa9818, fd9a3fa0) + 66 feec23bc cond_timedwait (feaa97f8, feaa9818, fd9a3fa0) + 35 fea65618 umem_update_thread (0) + 1ff feec7e31 _thrp_setup (fe2b5a40) + 81 feec80e0 _lwp_start (fe2b5a40, 0, 0, 0, 0, 0) -------------------- thread# 13 / lwp# 13 -------------------- feecd6f8 door (6, fd5baf18, 0, 0, 0, 3) feeb7cbf door_call (6, fd5baf18) + 82 fec8525b __ns_ldap_trydoorcall_send (fd5bcfac, fd5bcfa8, fd5bcfa4) + 4c fec90fc6 get_server_change (819aea0) + 118 feec7e31 _thrp_setup (fe2b6240) + 81 feec80e0 _lwp_start (fe2b6240, 0, 0, 0, 0, 0) ------------- thread# 19 / lwp# 19 [revalidate] -------------- feecbef5 nanosleep (fd3c0f78, fd3c0f70) feeb7ab5 sleep (258) + 31 0805bf24 revalidate (8182e08) + 33 feec7e31 _thrp_setup (fe2b7240) + 81 feec80e0 _lwp_start (fe2b7240, 0, 0, 0, 0, 0) --------------- thread# 20 / lwp# 20 [reaper] ---------------- feecbef5 nanosleep (fd4bff78, fd4bff70) feeb7ab5 sleep (259) + 31 0805c904 reaper (8182e08) + 1b9 feec7e31 _thrp_setup (fe2b6a40) + 81 feec80e0 _lwp_start (fe2b6a40, 0, 0, 0, 0, 0) ------------- thread# 21 / lwp# 21 [revalidate] -------------- feecbef5 nanosleep (fd2c1f78, fd2c1f70) feeb7ab5 sleep (258) + 31 0805bf24 revalidate (81551c8) + 33 feec7e31 _thrp_setup (fe2b7a40) + 81 feec80e0 _lwp_start (fe2b7a40, 0, 0, 0, 0, 0) --------------- thread# 22 / lwp# 22 [reaper] ---------------- feecbef5 nanosleep (fd1c2f78, fd1c2f70) feeb7ab5 sleep (258) + 31 0805c920 reaper (81551c8) + 1d5 feec7e31 _thrp_setup (fe2b8240) + 81 feec80e0 _lwp_start (fe2b8240, 0, 0, 0, 0, 0) ------------- thread# 23 / lwp# 23 [revalidate] -------------- feecbef5 nanosleep (fd0c3f78, fd0c3f70) feeb7ab5 sleep (258) + 31 0805bf24 revalidate (8182208) + 33 feec7e31 _thrp_setup (fe2b8a40) + 81 feec80e0 _lwp_start (fe2b8a40, 0, 0, 0, 0, 0) --------------- thread# 24 / lwp# 24 [reaper] ---------------- feecbef5 nanosleep (fcfc4f78, fcfc4f70) feeb7ab5 sleep (258) + 31 0805c920 reaper (8182208) + 1d5 feec7e31 _thrp_setup (fe2b9240) + 81 feec80e0 _lwp_start (fe2b9240, 0, 0, 0, 0, 0) ------------- thread# 25 / lwp# 25 [revalidate] -------------- feecbef5 nanosleep (fcec5f78, fcec5f70) feeb7ab5 sleep (258) + 31 0805bf24 revalidate (8155348) + 33 feec7e31 _thrp_setup (fe2b9a40) + 81 feec80e0 _lwp_start (fe2b9a40, 0, 0, 0, 0, 0) --------------- thread# 26 / lwp# 26 [reaper] ---------------- feecbef5 nanosleep (fcdc6f78, fcdc6f70) feeb7ab5 sleep (258) + 31 0805c920 reaper (8155348) + 1d5 feec7e31 _thrp_setup (fe2ba240) + 81 feec80e0 _lwp_start (fe2ba240, 0, 0, 0, 0, 0) ------------- thread# 67 / lwp# 67 [revalidate] -------------- feecbef5 nanosleep (fccc7f78, fccc7f70) feeb7ab5 sleep (258) + 31 0805bf24 revalidate (8155648) + 33 feec7e31 _thrp_setup (fe2baa40) + 81 feec80e0 _lwp_start (fe2baa40, 0, 0, 0, 0, 0) --------------- thread# 68 / lwp# 68 [reaper] ---------------- feecbef5 nanosleep (fcbc8f78, fcbc8f70) feeb7ab5 sleep (259) + 31 0805c904 reaper (8155648) + 1b9 feec7e31 _thrp_setup (fe2bb240) + 81 feec80e0 _lwp_start (fe2bb240, 0, 0, 0, 0, 0) ----------- thread# 69 / lwp# 69 [server_tsd_bind] ----------- feecd93e door (fca3ac60, 10e, 0, fcabee00, f5f00, a) 0806e54e lookup (fcabed18, e8) + 144 0806f42c switcher (deadbeed, fcabed18, e8, 0, 0) + 151 feecd95b __door_return () + 4b -------- thread# 196 / lwp# 196 [reclaim_getent_ctx] --------- feecbef5 nanosleep (fc9bff58, fc9bff50) feeb7ab5 sleep (3c) + 31 08063f6b reclaim_getent_ctx (0) + 1b6 feec7e31 _thrp_setup (fe2bc240) + 81 feec80e0 _lwp_start (fe2bc240, 0, 0, 0, 0, 0) ------------ thread# 197 / lwp# 197 [revalidate] ------------- feecbef5 nanosleep (fc8aef78, fc8aef70) feeb7ab5 sleep (258) + 31 0805bf24 revalidate (8182688) + 33 feec7e31 _thrp_setup (fe2bca40) + 81 feec80e0 _lwp_start (fe2bca40, 0, 0, 0, 0, 0) -------------- thread# 198 / lwp# 198 [reaper] --------------- feecbef5 nanosleep (fc7aff78, fc7aff70) feeb7ab5 sleep (258) + 31 0805c920 reaper (8182688) + 1d5 feec7e31 _thrp_setup (fe2bd240) + 81 feec80e0 _lwp_start (fe2bd240, 0, 0, 0, 0, 0)
Updated by Cullum Smith over 2 years ago
This is actually not related to GSSAPI authentication. I'm now reproducing it with anonymous authentication as well. I can reproduce on at least a few hosts (but not all) just by bouncing the LDAP server.
I can reliably reproduce this issue on SmartOS 20200925T164355Z and OmniOS r151036.
For now I disabled nscd across my environment, and there have been no further connectivity issues.
Updated by Cullum Smith over 2 years ago
Actually, I think this is related to a customization I had done previously:
svccfg -s svc:/system/name-service-cache setprop config/enable_per_user_lookup=boolean: false
After reverting this setting to its default (true), I can't reproduce this by simply restarting the LDAP server.
Perhaps this was the root issue. I'll leave this in place for a few days and see if the issue recurs.
Updated by Cullum Smith over 2 years ago
Unfortunately this is back to happening regularly. It seems as though nscd can't handle the ldap server dropping the connection. TCP connections remain in TIME_WAIT forever until the daemon is bounced.
Updated by Cullum Smith over 2 years ago
I did some additional research here. I believe the root of the problem is that, when the TCP connection to an LDAP server is reset, nscd sometimes gets stuck in a weird state where it never picks up the reconnection.
Careful examination of some truss output of nscd indicates that this function is returning -1 with LDAP_SERVER_DOWN:
/*
* the logic here is:
* if
* 1. no connections exists,
* or
* 2. if the connection is either not in the connected
* or connecting state in an async io model
* or
* 3. the connection is notin a connected state with normal (non async io)
*/
if ( lc == NULL
|| ( (ld->ld_options & LDAP_BITOPT_ASYNC
&& lc->lconn_status != LDAP_CONNST_CONNECTING
&& lc->lconn_status != LDAP_CONNST_CONNECTED)
|| (!(ld->ld_options & LDAP_BITOPT_ASYNC )
&& lc->lconn_status != LDAP_CONNST_CONNECTED) ) ) {
ber_free( ber, 1 );
if ( lc != NULL ) {
LDAP_SET_LDERRNO( ld, LDAP_SERVER_DOWN, NULL, NULL );
}
if ( incparent ) {
/* Forget about the bind */
--parentreq->lr_outrefcnt;
}
LDAP_MUTEX_UNLOCK( ld, LDAP_CONN_LOCK );
return( -1 );
}
When this is occuring, /usr/lib/ldap/ldap_cachemgr -g
always shows the LDAP servers as "UP". Restarting name-server-cache is always sufficient to solve the problem. So to summarize, it seems the sequence of events is:
1. ldapclient (and nsswitch.conf) are configured to use one or more LDAP servers, along with nscd.
2. At some point, the LDAP servers reboot, resetting the TCP connection(s).
3. Once the nscd cache expires, getent passwd
stops returning any LDAP results. No new connections are even attempted. "Can't contact LDAP server" is spammed in the system log. ldaplist passwd
continues to work fine.
4. Either restarting (or just disabling) nscd causes LDAP passwd lookups to immediately start working again.
I'm guessing something asynchronous in nscd's libsldap state machine is just not "seeing" the new LDAP connection state? I've worked around this by just disabling nscd, but this obviously isn't ideal.
Updated by Cullum Smith over 2 years ago
It took hours of debugging, and while I don't have a root cause, I narrowed down the trigger to a seemingly innocuous sequence of commands in my config management scripts (as well as a reliable (so far) workaround).
To recap, after running my config management script (which sets up the ldapclient, dns, etc), all my Illumos hosts (both SmartOS and OmniOS) seem to be left in a vulnerable state in which nscd cannot recover from a blip in connectivity to the LDAP servers. Restarting nscd resolves the issue, and any subsequent LDAP blips after this will recover without issue. Re-running my setup script will always put a server back into this "vulnerable" state.
I found that the following sequence of commands is sufficient to recreate the problem:
# Set a very low cache timeout for nscd to make debugging easier. in /etc/nscd.conf:
# positive-time-to-live group 5
# positive-time-to-live passwd 5
# 1. enable nscd
svcadm enable name-service-cache
# 2. (re-)configure the ldapclient
ldapclient manual -a my_args...
# 3. Run a full group lookup for a user. `id -u` is not sufficient to reproduce.
id root
# 4. Then, bounce/restart the remote LDAP server(s).
# The TCP connection(s) to the LDAP server(s) on the illumos box will be reset.
# 5. You'll find that passwd lookups for remote users start failing, and never recover.
getent passwd someuser # fails
# 6. Restart nscd to resolve the problem:
svcadm restart name-service-cache
getent passwd someuser # works
This is the very weird part: if you never run a full group lookup (step 3), you can't reproduce! Things like getent passwd myuser
, or id -u myuser
are not sufficient to put nscd into this vulnerable state.
So, as a fix, I put this at the end of my setup script:
# dummy group lookup to workaround illumos bug 13326
id root > /dev/null
restart_service system/name-service-cache
And even more weird, the order of these two commands is also critical. If you restart nscd before running the dummy group lookup, it's still vulnerable to the perma-disconnect if the LDAP server ever drops the connection. You have to restart nscd after.
I suspect I only hit this weird corner case because I was re-running the "ldapclient manual" initialization command repeatedly as part of my idempotent(ish) configuration management script. Most people probably do not do this.
Anyway, maybe this workaround will help someone else!
Updated by Cullum Smith over 2 years ago
Ha - it turns out the bug immediately re-manifests once a non-root user logs in over ssh. I think I've done all the troubleshooting I can do on this - as far as I can tell, nscd just does not handle LDAP disconnects properly.
Updated by Jorge Schrauwen over 1 year ago
Interesting read, I am experiencing something similar were nscd stops showing ldap results of the severs are rebooted, I’ll check for the TIME_WAIT state next time it happens.
Updated by Jorge Schrauwen over 1 year ago
I did not find the connection in this state, I found none at all.
But restarting, or well disabling name-service-cache does bringt back user and group lookups