Bug #15024


NFS can exhaust pool threads getting RPCSEC_GSS credentials

Added by Matt Barden 2 months ago. Updated about 2 months ago.

In Progress
nfs - NFS server and client
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


Today, RPCSEC_GSS authentication is broken into two phases:

1. When we first get a NULL procedure, we call kgss_accept_sec_context(), which checks whether the ticket is valid. If this succeeds, authentication succeeds, and the client can start sending NFS4 requests
2. The first time we get a non-NULL procedure, and every few hours afterwards, we call kgsscred_expname_to_unix_cred(), which maps the principal name acquired from the ticket in 1) to a unix account.

If 2) fails, instead of caching the uid and gid, the user is mapped to the 'nobody' user, and the client continues as normal. Since not all nfs requests access an export, these requests can still succeed. However, this result isn't cached, which means every single time we get one of these, we call down into gssd to try to get a unix account again. This is a problem, because doing this - unsuccessfully - can take an extraordinarily long time (for example, if the realm or kdc needs to be discovered via DNS, and DNS is bad or slow; I've seen it take over 5 seconds before). Worse, gssd is actually single-threaded, which means every thread in the NFS pool can be stuck waiting for a gssd thread, just to slowly discover that they can't actually do anything.

We can address this by moving the 'map the principal to a unix cred' code to do_gss_accept(), which occurs in a separate taskq. As part of this, the user's mapping is no longer periodically updated (or we could end up back in the same position); the gss context instead needs to be destroyed and re-created (such as by re-mounting the share).

Tested using the utility in #15022 to generate many contexts at the same time, as well as by testing that krb5 nfs mounts continue to work (and identify users correctly).

Actions #1

Updated by Electric Monk 2 months ago

  • Gerrit CR set to 2402
Actions #2

Updated by Adam Stylinski about 2 months ago

I believe I've actually been bitten by this one. When it happens, even login on the local console is basically impossible.


Also available in: Atom PDF