Project

General

Profile

Actions

Bug #13326

open

nscd stops returning results from LDAP until it is restarted (GSSAPI)

Added by Cullum Smith 10 months ago. Updated 9 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cmd - userland programs
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I have multiple hosts with computer account keytabs that use ldapclient to get passwd/group information from Active Directory (well, technically samba). When using the sasl/GSSAPI authentication method, nscd occasionally gets into a bad state where it stops returning any results from LDAP until the daemon is restarted.

This problem manifests with the following log messages:

nscd[99594]: [ID 293258 user.warning] libsldap: Status: 7  Mesg: LDAP ERROR (81): Can't contact LDAP server.
nscd[99594]: [ID 702911 auth.notice] GSSAPI Error: The referenced context has expired (Unknown error)

At first I suspected that the host credential cache was stale, but this is not the case. Commands like ldaplist passwd work fine, and klist shows an up-to-date TGT and service ticket for the LDAP server. Restarting nscd via svcadm restart name-service-cache always fixes the issue, but it often recurs after a few days.

When nscd gets into this state, getent passwd returns instantly with all the entries from /etc/password, but the LDAP server is not even queried. When attaching to nscd with truss or watching TCP traffic with snoop, no connections to the LDAP server are attempted, even after invalidating the caches with nscd -i. When attempting passwd lookups, you can watch the "negative misses" counter in the output of nscd -g accumulate with each query.

Whenever this happens, netstat shows that the nscd process has a stale TCP socket to the LDAP server stuck in the CLOSE_WAIT state. This socket stays in CLOSE_WAIT until nscd is manually restarted. The LDAP server shows no open socket to the client when in this state, so I suspect that the server closed the connection, but ldapclient gets into some state where the descriptor is never cleaned up at the application level.

The following shows the stack of nscd while it's in this state. I can easily troubleshoot further, as this happens to at least one of my hosts per day.

99594:  /usr/sbin/nscd
--------------------- thread# 1 / lwp# 1 ---------------------
 feecca35 pause    ()
 08059a93 main     (8047d7c, fef48648) + 936
 08058d57 _start_crt (1, 8047de4, fefd00e7, 0, 0, 0) + 96
 08058c2a _start   (1, 8047ecc, 0, 8047edb, 8047ef3, 8047efa) + 1a
------------- thread# 2 / lwp# 2 [set_smf_state] -------------
 feecbef5 nanosleep (fe3bef98, fe3bef90)
 feeb7ab5 sleep    (78) + 31
 0806868b set_smf_state (0) + 3e
 feec7e31 _thrp_setup (fe2b0240) + 81
 feec80e0 _lwp_start (fe2b0240, 0, 0, 0, 0, 0)
--------------------- thread# 3 / lwp# 3 ---------------------
 feecd8f6 door     (0, 0, 0, 0, 0, 8)
 feeb248d door_unref_func (1850a) + 4e
 feec7e31 _thrp_setup (fe2b0a40) + 81
 feec80e0 _lwp_start (fe2b0a40, 0, 0, 0, 0, 0)
------------ thread# 4 / lwp# 4 [server_tsd_bind] ------------
 feecd93e door     (fe1abcb0, da, 0, fe1afe00, f5f00, a)
 0806e54e lookup   (fe1afd50, b0) + 144
 0806f42c switcher (deadbeed, fe1afd50, b0, 0, 0) + 151
 feecd95b __door_return () + 4b
---------------- thread# 5 / lwp# 5 [rts_mon] ----------------
 feeccc65 read     (4, fe0b0674, 94c)
 0806e284 rts_mon  () + 4a
 feec7e31 _thrp_setup (fe2b1a40) + 81
 feec80e0 _lwp_start (fe2b1a40, 0, 0, 0, 0, 0)
------------ thread# 6 / lwp# 6 [server_tsd_bind] ------------
 feecd93e door     (fdf2dc80, cb, 0, fdfb1e00, f5f00, a)
 0806e54e lookup   (fdfb1d40, c0) + 144
 0806f42c switcher (deadbeed, fdfb1d40, c0, 0, 0) + 151
 feecd95b __door_return () + 4b
------------ thread# 7 / lwp# 7 [forker_monitor] -------------
 feecd6f8 door     (5, fde9ef48, 0, 0, 0, 3)
 feeb7cbf door_call (5, fde9ef48) + 82
 080711ef _nscd_doorcall_fd (5, 6e000051, 0, 0, 0, 0, 0) + d9
 0806be0f selfcred_pulse (5) + 3b
 0806c95e forker_monitor (0) + 2e
 feec7e31 _thrp_setup (fe2b3240) + 81
 feec80e0 _lwp_start (fe2b3240, 0, 0, 0, 0, 0)
-------------- thread# 8 / lwp# 8 [revalidate] ---------------
 feecbef5 nanosleep (fdd9ff78, fdd9ff70)
 feeb7ab5 sleep    (c8) + 31
 0805bf87 revalidate (8155ac8) + 96
 feec7e31 _thrp_setup (fe2b3a40) + 81
 feec80e0 _lwp_start (fe2b3a40, 0, 0, 0, 0, 0)
---------------- thread# 9 / lwp# 9 [reaper] -----------------
 feecbef5 nanosleep (fdca0f78, fdca0f70)
 feeb7ab5 sleep    (12d) + 31
 0805c904 reaper   (8155ac8) + 1b9
 feec7e31 _thrp_setup (fe2b4240) + 81
 feec80e0 _lwp_start (fe2b4240, 0, 0, 0, 0, 0)
------------- thread# 10 / lwp# 10 [revalidate] --------------
 feecbef5 nanosleep (fdba1f78, fdba1f70)
 feeb7ab5 sleep    (c8) + 31
 0805bf87 revalidate (8155948) + 96
 feec7e31 _thrp_setup (fe2b4a40) + 81
 feec80e0 _lwp_start (fe2b4a40, 0, 0, 0, 0, 0)
--------------- thread# 11 / lwp# 11 [reaper] ----------------
 feecbef5 nanosleep (fdaa2f78, fdaa2f70)
 feeb7ab5 sleep    (12c) + 31
 0805c920 reaper   (8155948) + 1d5
 feec7e31 _thrp_setup (fe2b5240) + 81
 feec80e0 _lwp_start (fe2b5240, 0, 0, 0, 0, 0)
-------------------- thread# 12 / lwp# 12 --------------------
 feec8139 lwp_park (0, fd9a3f28, 0)
 feec1ec7 cond_wait_queue (feaa97f8, feaa9818, fd9a3f28) + 5f
 feec2224 cond_wait_common (feaa97f8, feaa9818, fd9a3f28) + 2ed
 feec22e2 __cond_timedwait (feaa97f8, feaa9818, fd9a3fa0) + 66
 feec23bc cond_timedwait (feaa97f8, feaa9818, fd9a3fa0) + 35
 fea65618 umem_update_thread (0) + 1ff
 feec7e31 _thrp_setup (fe2b5a40) + 81
 feec80e0 _lwp_start (fe2b5a40, 0, 0, 0, 0, 0)
-------------------- thread# 13 / lwp# 13 --------------------
 feecd6f8 door     (6, fd5baf18, 0, 0, 0, 3)
 feeb7cbf door_call (6, fd5baf18) + 82
 fec8525b __ns_ldap_trydoorcall_send (fd5bcfac, fd5bcfa8, fd5bcfa4) + 4c
 fec90fc6 get_server_change (819aea0) + 118
 feec7e31 _thrp_setup (fe2b6240) + 81
 feec80e0 _lwp_start (fe2b6240, 0, 0, 0, 0, 0)
------------- thread# 19 / lwp# 19 [revalidate] --------------
 feecbef5 nanosleep (fd3c0f78, fd3c0f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8182e08) + 33
 feec7e31 _thrp_setup (fe2b7240) + 81
 feec80e0 _lwp_start (fe2b7240, 0, 0, 0, 0, 0)
--------------- thread# 20 / lwp# 20 [reaper] ----------------
 feecbef5 nanosleep (fd4bff78, fd4bff70)
 feeb7ab5 sleep    (259) + 31
 0805c904 reaper   (8182e08) + 1b9
 feec7e31 _thrp_setup (fe2b6a40) + 81
 feec80e0 _lwp_start (fe2b6a40, 0, 0, 0, 0, 0)
------------- thread# 21 / lwp# 21 [revalidate] --------------
 feecbef5 nanosleep (fd2c1f78, fd2c1f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (81551c8) + 33
 feec7e31 _thrp_setup (fe2b7a40) + 81
 feec80e0 _lwp_start (fe2b7a40, 0, 0, 0, 0, 0)
--------------- thread# 22 / lwp# 22 [reaper] ----------------
 feecbef5 nanosleep (fd1c2f78, fd1c2f70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (81551c8) + 1d5
 feec7e31 _thrp_setup (fe2b8240) + 81
 feec80e0 _lwp_start (fe2b8240, 0, 0, 0, 0, 0)
------------- thread# 23 / lwp# 23 [revalidate] --------------
 feecbef5 nanosleep (fd0c3f78, fd0c3f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8182208) + 33
 feec7e31 _thrp_setup (fe2b8a40) + 81
 feec80e0 _lwp_start (fe2b8a40, 0, 0, 0, 0, 0)
--------------- thread# 24 / lwp# 24 [reaper] ----------------
 feecbef5 nanosleep (fcfc4f78, fcfc4f70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (8182208) + 1d5
 feec7e31 _thrp_setup (fe2b9240) + 81
 feec80e0 _lwp_start (fe2b9240, 0, 0, 0, 0, 0)
------------- thread# 25 / lwp# 25 [revalidate] --------------
 feecbef5 nanosleep (fcec5f78, fcec5f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8155348) + 33
 feec7e31 _thrp_setup (fe2b9a40) + 81
 feec80e0 _lwp_start (fe2b9a40, 0, 0, 0, 0, 0)
--------------- thread# 26 / lwp# 26 [reaper] ----------------
 feecbef5 nanosleep (fcdc6f78, fcdc6f70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (8155348) + 1d5
 feec7e31 _thrp_setup (fe2ba240) + 81
 feec80e0 _lwp_start (fe2ba240, 0, 0, 0, 0, 0)
------------- thread# 67 / lwp# 67 [revalidate] --------------
 feecbef5 nanosleep (fccc7f78, fccc7f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8155648) + 33
 feec7e31 _thrp_setup (fe2baa40) + 81
 feec80e0 _lwp_start (fe2baa40, 0, 0, 0, 0, 0)
--------------- thread# 68 / lwp# 68 [reaper] ----------------
 feecbef5 nanosleep (fcbc8f78, fcbc8f70)
 feeb7ab5 sleep    (259) + 31
 0805c904 reaper   (8155648) + 1b9
 feec7e31 _thrp_setup (fe2bb240) + 81
 feec80e0 _lwp_start (fe2bb240, 0, 0, 0, 0, 0)
----------- thread# 69 / lwp# 69 [server_tsd_bind] -----------
 feecd93e door     (fca3ac60, 10e, 0, fcabee00, f5f00, a)
 0806e54e lookup   (fcabed18, e8) + 144
 0806f42c switcher (deadbeed, fcabed18, e8, 0, 0) + 151
 feecd95b __door_return () + 4b
-------- thread# 196 / lwp# 196 [reclaim_getent_ctx] ---------
 feecbef5 nanosleep (fc9bff58, fc9bff50)
 feeb7ab5 sleep    (3c) + 31
 08063f6b reclaim_getent_ctx (0) + 1b6
 feec7e31 _thrp_setup (fe2bc240) + 81
 feec80e0 _lwp_start (fe2bc240, 0, 0, 0, 0, 0)
------------ thread# 197 / lwp# 197 [revalidate] -------------
 feecbef5 nanosleep (fc8aef78, fc8aef70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8182688) + 33
 feec7e31 _thrp_setup (fe2bca40) + 81
 feec80e0 _lwp_start (fe2bca40, 0, 0, 0, 0, 0)
-------------- thread# 198 / lwp# 198 [reaper] ---------------
 feecbef5 nanosleep (fc7aff78, fc7aff70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (8182688) + 1d5
 feec7e31 _thrp_setup (fe2bd240) + 81
 feec80e0 _lwp_start (fe2bd240, 0, 0, 0, 0, 0)
Actions #1

Updated by Cullum Smith 10 months ago

This is actually not related to GSSAPI authentication. I'm now reproducing it with anonymous authentication as well. I can reproduce on at least a few hosts (but not all) just by bouncing the LDAP server.

I can reliably reproduce this issue on SmartOS 20200925T164355Z and OmniOS r151036.

For now I disabled nscd across my environment, and there have been no further connectivity issues.

Actions #2

Updated by Cullum Smith 10 months ago

Actually, I think this is related to a customization I had done previously:

svccfg -s svc:/system/name-service-cache setprop config/enable_per_user_lookup=boolean: false

After reverting this setting to its default (true), I can't reproduce this by simply restarting the LDAP server.

Perhaps this was the root issue. I'll leave this in place for a few days and see if the issue recurs.

Actions #3

Updated by Cullum Smith 9 months ago

Unfortunately this is back to happening regularly. It seems as though nscd can't handle the ldap server dropping the connection. TCP connections remain in TIME_WAIT forever until the daemon is bounced.

Actions #4

Updated by Cullum Smith 9 months ago

I did some additional research here. I believe the root of the problem is that, when the TCP connection to an LDAP server is reset, nscd sometimes gets stuck in a weird state where it never picks up the reconnection.

Careful examination of some truss output of nscd indicates that this function is returning -1 with LDAP_SERVER_DOWN:

/*
     * the logic here is:
     * if 
     * 1. no connections exists, 
     * or 
     * 2. if the connection is either not in the connected 
     *     or connecting state in an async io model
     * or 
     * 3. the connection is notin a connected state with normal (non async io)
     */
    if (   lc == NULL
        || (  (ld->ld_options & LDAP_BITOPT_ASYNC 
               && lc->lconn_status != LDAP_CONNST_CONNECTING
            && lc->lconn_status != LDAP_CONNST_CONNECTED)
              || (!(ld->ld_options & LDAP_BITOPT_ASYNC )
            && lc->lconn_status != LDAP_CONNST_CONNECTED) ) ) {

        ber_free( ber, 1 );
        if ( lc != NULL ) {
            LDAP_SET_LDERRNO( ld, LDAP_SERVER_DOWN, NULL, NULL );
        }
        if ( incparent ) {
            /* Forget about the bind */
            --parentreq->lr_outrefcnt; 
        }
        LDAP_MUTEX_UNLOCK( ld, LDAP_CONN_LOCK );
        return( -1 );
    }

When this is occuring, /usr/lib/ldap/ldap_cachemgr -g always shows the LDAP servers as "UP". Restarting name-server-cache is always sufficient to solve the problem. So to summarize, it seems the sequence of events is:

1. ldapclient (and nsswitch.conf) are configured to use one or more LDAP servers, along with nscd.
2. At some point, the LDAP servers reboot, resetting the TCP connection(s).
3. Once the nscd cache expires, getent passwd stops returning any LDAP results. No new connections are even attempted. "Can't contact LDAP server" is spammed in the system log. ldaplist passwd continues to work fine.
4. Either restarting (or just disabling) nscd causes LDAP passwd lookups to immediately start working again.

I'm guessing something asynchronous in nscd's libsldap state machine is just not "seeing" the new LDAP connection state? I've worked around this by just disabling nscd, but this obviously isn't ideal.

Actions #5

Updated by Cullum Smith 9 months ago

It took hours of debugging, and while I don't have a root cause, I narrowed down the trigger to a seemingly innocuous sequence of commands in my config management scripts (as well as a reliable (so far) workaround).

To recap, after running my config management script (which sets up the ldapclient, dns, etc), all my Illumos hosts (both SmartOS and OmniOS) seem to be left in a vulnerable state in which nscd cannot recover from a blip in connectivity to the LDAP servers. Restarting nscd resolves the issue, and any subsequent LDAP blips after this will recover without issue. Re-running my setup script will always put a server back into this "vulnerable" state.

I found that the following sequence of commands is sufficient to recreate the problem:

# Set a very low cache timeout for nscd to make debugging easier. in /etc/nscd.conf:
#   positive-time-to-live  group  5
#   positive-time-to-live  passwd  5

# 1. enable nscd
svcadm enable name-service-cache

# 2. (re-)configure the ldapclient
ldapclient manual -a my_args...

# 3. Run a full group lookup for a user. `id -u` is not sufficient to reproduce.
id root

# 4. Then, bounce/restart the remote LDAP server(s).
#     The TCP connection(s) to the LDAP server(s) on the illumos box will be reset.

# 5. You'll find that passwd lookups for remote users start failing, and never recover.
getent passwd someuser # fails

# 6. Restart nscd to resolve the problem:
svcadm restart name-service-cache
getent passwd someuser # works

This is the very weird part: if you never run a full group lookup (step 3), you can't reproduce! Things like getent passwd myuser, or id -u myuser are not sufficient to put nscd into this vulnerable state.

So, as a fix, I put this at the end of my setup script:

# dummy group lookup to workaround illumos bug 13326
id root > /dev/null
restart_service system/name-service-cache

And even more weird, the order of these two commands is also critical. If you restart nscd before running the dummy group lookup, it's still vulnerable to the perma-disconnect if the LDAP server ever drops the connection. You have to restart nscd after.

I suspect I only hit this weird corner case because I was re-running the "ldapclient manual" initialization command repeatedly as part of my idempotent(ish) configuration management script. Most people probably do not do this.

Anyway, maybe this workaround will help someone else!

Actions #6

Updated by Cullum Smith 9 months ago

Ha - it turns out the bug immediately re-manifests once a non-root user logs in over ssh. I think I've done all the troubleshooting I can do on this - as far as I can tell, nscd just does not handle LDAP disconnects properly.

Actions

Also available in: Atom PDF