Project

General

Profile

Bug #13326

nscd stops returning results from LDAP until it is restarted (GSSAPI)

Added by Cullum Smith 9 days ago. Updated 6 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cmd - userland programs
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I have multiple hosts with computer account keytabs that use ldapclient to get passwd/group information from Active Directory (well, technically samba). When using the sasl/GSSAPI authentication method, nscd occasionally gets into a bad state where it stops returning any results from LDAP until the daemon is restarted.

This problem manifests with the following log messages:

nscd[99594]: [ID 293258 user.warning] libsldap: Status: 7  Mesg: LDAP ERROR (81): Can't contact LDAP server.
nscd[99594]: [ID 702911 auth.notice] GSSAPI Error: The referenced context has expired (Unknown error)

At first I suspected that the host credential cache was stale, but this is not the case. Commands like ldaplist passwd work fine, and klist shows an up-to-date TGT and service ticket for the LDAP server. Restarting nscd via svcadm restart name-service-cache always fixes the issue, but it often recurs after a few days.

When nscd gets into this state, getent passwd returns instantly with all the entries from /etc/password, but the LDAP server is not even queried. When attaching to nscd with truss or watching TCP traffic with snoop, no connections to the LDAP server are attempted, even after invalidating the caches with nscd -i. When attempting passwd lookups, you can watch the "negative misses" counter in the output of nscd -g accumulate with each query.

Whenever this happens, netstat shows that the nscd process has a stale TCP socket to the LDAP server stuck in the CLOSE_WAIT state. This socket stays in CLOSE_WAIT until nscd is manually restarted. The LDAP server shows no open socket to the client when in this state, so I suspect that the server closed the connection, but ldapclient gets into some state where the descriptor is never cleaned up at the application level.

The following shows the stack of nscd while it's in this state. I can easily troubleshoot further, as this happens to at least one of my hosts per day.

99594:  /usr/sbin/nscd
--------------------- thread# 1 / lwp# 1 ---------------------
 feecca35 pause    ()
 08059a93 main     (8047d7c, fef48648) + 936
 08058d57 _start_crt (1, 8047de4, fefd00e7, 0, 0, 0) + 96
 08058c2a _start   (1, 8047ecc, 0, 8047edb, 8047ef3, 8047efa) + 1a
------------- thread# 2 / lwp# 2 [set_smf_state] -------------
 feecbef5 nanosleep (fe3bef98, fe3bef90)
 feeb7ab5 sleep    (78) + 31
 0806868b set_smf_state (0) + 3e
 feec7e31 _thrp_setup (fe2b0240) + 81
 feec80e0 _lwp_start (fe2b0240, 0, 0, 0, 0, 0)
--------------------- thread# 3 / lwp# 3 ---------------------
 feecd8f6 door     (0, 0, 0, 0, 0, 8)
 feeb248d door_unref_func (1850a) + 4e
 feec7e31 _thrp_setup (fe2b0a40) + 81
 feec80e0 _lwp_start (fe2b0a40, 0, 0, 0, 0, 0)
------------ thread# 4 / lwp# 4 [server_tsd_bind] ------------
 feecd93e door     (fe1abcb0, da, 0, fe1afe00, f5f00, a)
 0806e54e lookup   (fe1afd50, b0) + 144
 0806f42c switcher (deadbeed, fe1afd50, b0, 0, 0) + 151
 feecd95b __door_return () + 4b
---------------- thread# 5 / lwp# 5 [rts_mon] ----------------
 feeccc65 read     (4, fe0b0674, 94c)
 0806e284 rts_mon  () + 4a
 feec7e31 _thrp_setup (fe2b1a40) + 81
 feec80e0 _lwp_start (fe2b1a40, 0, 0, 0, 0, 0)
------------ thread# 6 / lwp# 6 [server_tsd_bind] ------------
 feecd93e door     (fdf2dc80, cb, 0, fdfb1e00, f5f00, a)
 0806e54e lookup   (fdfb1d40, c0) + 144
 0806f42c switcher (deadbeed, fdfb1d40, c0, 0, 0) + 151
 feecd95b __door_return () + 4b
------------ thread# 7 / lwp# 7 [forker_monitor] -------------
 feecd6f8 door     (5, fde9ef48, 0, 0, 0, 3)
 feeb7cbf door_call (5, fde9ef48) + 82
 080711ef _nscd_doorcall_fd (5, 6e000051, 0, 0, 0, 0, 0) + d9
 0806be0f selfcred_pulse (5) + 3b
 0806c95e forker_monitor (0) + 2e
 feec7e31 _thrp_setup (fe2b3240) + 81
 feec80e0 _lwp_start (fe2b3240, 0, 0, 0, 0, 0)
-------------- thread# 8 / lwp# 8 [revalidate] ---------------
 feecbef5 nanosleep (fdd9ff78, fdd9ff70)
 feeb7ab5 sleep    (c8) + 31
 0805bf87 revalidate (8155ac8) + 96
 feec7e31 _thrp_setup (fe2b3a40) + 81
 feec80e0 _lwp_start (fe2b3a40, 0, 0, 0, 0, 0)
---------------- thread# 9 / lwp# 9 [reaper] -----------------
 feecbef5 nanosleep (fdca0f78, fdca0f70)
 feeb7ab5 sleep    (12d) + 31
 0805c904 reaper   (8155ac8) + 1b9
 feec7e31 _thrp_setup (fe2b4240) + 81
 feec80e0 _lwp_start (fe2b4240, 0, 0, 0, 0, 0)
------------- thread# 10 / lwp# 10 [revalidate] --------------
 feecbef5 nanosleep (fdba1f78, fdba1f70)
 feeb7ab5 sleep    (c8) + 31
 0805bf87 revalidate (8155948) + 96
 feec7e31 _thrp_setup (fe2b4a40) + 81
 feec80e0 _lwp_start (fe2b4a40, 0, 0, 0, 0, 0)
--------------- thread# 11 / lwp# 11 [reaper] ----------------
 feecbef5 nanosleep (fdaa2f78, fdaa2f70)
 feeb7ab5 sleep    (12c) + 31
 0805c920 reaper   (8155948) + 1d5
 feec7e31 _thrp_setup (fe2b5240) + 81
 feec80e0 _lwp_start (fe2b5240, 0, 0, 0, 0, 0)
-------------------- thread# 12 / lwp# 12 --------------------
 feec8139 lwp_park (0, fd9a3f28, 0)
 feec1ec7 cond_wait_queue (feaa97f8, feaa9818, fd9a3f28) + 5f
 feec2224 cond_wait_common (feaa97f8, feaa9818, fd9a3f28) + 2ed
 feec22e2 __cond_timedwait (feaa97f8, feaa9818, fd9a3fa0) + 66
 feec23bc cond_timedwait (feaa97f8, feaa9818, fd9a3fa0) + 35
 fea65618 umem_update_thread (0) + 1ff
 feec7e31 _thrp_setup (fe2b5a40) + 81
 feec80e0 _lwp_start (fe2b5a40, 0, 0, 0, 0, 0)
-------------------- thread# 13 / lwp# 13 --------------------
 feecd6f8 door     (6, fd5baf18, 0, 0, 0, 3)
 feeb7cbf door_call (6, fd5baf18) + 82
 fec8525b __ns_ldap_trydoorcall_send (fd5bcfac, fd5bcfa8, fd5bcfa4) + 4c
 fec90fc6 get_server_change (819aea0) + 118
 feec7e31 _thrp_setup (fe2b6240) + 81
 feec80e0 _lwp_start (fe2b6240, 0, 0, 0, 0, 0)
------------- thread# 19 / lwp# 19 [revalidate] --------------
 feecbef5 nanosleep (fd3c0f78, fd3c0f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8182e08) + 33
 feec7e31 _thrp_setup (fe2b7240) + 81
 feec80e0 _lwp_start (fe2b7240, 0, 0, 0, 0, 0)
--------------- thread# 20 / lwp# 20 [reaper] ----------------
 feecbef5 nanosleep (fd4bff78, fd4bff70)
 feeb7ab5 sleep    (259) + 31
 0805c904 reaper   (8182e08) + 1b9
 feec7e31 _thrp_setup (fe2b6a40) + 81
 feec80e0 _lwp_start (fe2b6a40, 0, 0, 0, 0, 0)
------------- thread# 21 / lwp# 21 [revalidate] --------------
 feecbef5 nanosleep (fd2c1f78, fd2c1f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (81551c8) + 33
 feec7e31 _thrp_setup (fe2b7a40) + 81
 feec80e0 _lwp_start (fe2b7a40, 0, 0, 0, 0, 0)
--------------- thread# 22 / lwp# 22 [reaper] ----------------
 feecbef5 nanosleep (fd1c2f78, fd1c2f70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (81551c8) + 1d5
 feec7e31 _thrp_setup (fe2b8240) + 81
 feec80e0 _lwp_start (fe2b8240, 0, 0, 0, 0, 0)
------------- thread# 23 / lwp# 23 [revalidate] --------------
 feecbef5 nanosleep (fd0c3f78, fd0c3f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8182208) + 33
 feec7e31 _thrp_setup (fe2b8a40) + 81
 feec80e0 _lwp_start (fe2b8a40, 0, 0, 0, 0, 0)
--------------- thread# 24 / lwp# 24 [reaper] ----------------
 feecbef5 nanosleep (fcfc4f78, fcfc4f70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (8182208) + 1d5
 feec7e31 _thrp_setup (fe2b9240) + 81
 feec80e0 _lwp_start (fe2b9240, 0, 0, 0, 0, 0)
------------- thread# 25 / lwp# 25 [revalidate] --------------
 feecbef5 nanosleep (fcec5f78, fcec5f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8155348) + 33
 feec7e31 _thrp_setup (fe2b9a40) + 81
 feec80e0 _lwp_start (fe2b9a40, 0, 0, 0, 0, 0)
--------------- thread# 26 / lwp# 26 [reaper] ----------------
 feecbef5 nanosleep (fcdc6f78, fcdc6f70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (8155348) + 1d5
 feec7e31 _thrp_setup (fe2ba240) + 81
 feec80e0 _lwp_start (fe2ba240, 0, 0, 0, 0, 0)
------------- thread# 67 / lwp# 67 [revalidate] --------------
 feecbef5 nanosleep (fccc7f78, fccc7f70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8155648) + 33
 feec7e31 _thrp_setup (fe2baa40) + 81
 feec80e0 _lwp_start (fe2baa40, 0, 0, 0, 0, 0)
--------------- thread# 68 / lwp# 68 [reaper] ----------------
 feecbef5 nanosleep (fcbc8f78, fcbc8f70)
 feeb7ab5 sleep    (259) + 31
 0805c904 reaper   (8155648) + 1b9
 feec7e31 _thrp_setup (fe2bb240) + 81
 feec80e0 _lwp_start (fe2bb240, 0, 0, 0, 0, 0)
----------- thread# 69 / lwp# 69 [server_tsd_bind] -----------
 feecd93e door     (fca3ac60, 10e, 0, fcabee00, f5f00, a)
 0806e54e lookup   (fcabed18, e8) + 144
 0806f42c switcher (deadbeed, fcabed18, e8, 0, 0) + 151
 feecd95b __door_return () + 4b
-------- thread# 196 / lwp# 196 [reclaim_getent_ctx] ---------
 feecbef5 nanosleep (fc9bff58, fc9bff50)
 feeb7ab5 sleep    (3c) + 31
 08063f6b reclaim_getent_ctx (0) + 1b6
 feec7e31 _thrp_setup (fe2bc240) + 81
 feec80e0 _lwp_start (fe2bc240, 0, 0, 0, 0, 0)
------------ thread# 197 / lwp# 197 [revalidate] -------------
 feecbef5 nanosleep (fc8aef78, fc8aef70)
 feeb7ab5 sleep    (258) + 31
 0805bf24 revalidate (8182688) + 33
 feec7e31 _thrp_setup (fe2bca40) + 81
 feec80e0 _lwp_start (fe2bca40, 0, 0, 0, 0, 0)
-------------- thread# 198 / lwp# 198 [reaper] ---------------
 feecbef5 nanosleep (fc7aff78, fc7aff70)
 feeb7ab5 sleep    (258) + 31
 0805c920 reaper   (8182688) + 1d5
 feec7e31 _thrp_setup (fe2bd240) + 81
 feec80e0 _lwp_start (fe2bd240, 0, 0, 0, 0, 0)
#1

Updated by Cullum Smith 7 days ago

This is actually not related to GSSAPI authentication. I'm now reproducing it with anonymous authentication as well. I can reproduce on at least a few hosts (but not all) just by bouncing the LDAP server.

I can reliably reproduce this issue on SmartOS 20200925T164355Z and OmniOS r151036.

For now I disabled nscd across my environment, and there have been no further connectivity issues.

#2

Updated by Cullum Smith 6 days ago

Actually, I think this is related to a customization I had done previously:

svccfg -s svc:/system/name-service-cache setprop config/enable_per_user_lookup=boolean: false

After reverting this setting to its default (true), I can't reproduce this by simply restarting the LDAP server.

Perhaps this was the root issue. I'll leave this in place for a few days and see if the issue recurs.

Also available in: Atom PDF