kcf can return handle to deregistered providers
While working on updating the TPM driver to support TPM2.0 modules, as part of the the testing, I added support for registering the TPM as a RNG provider (largely based off the existing code in the driver).
However, with that, the system panics.
The driver itself (as far as KCF is concerned) is fairly simple: it registers itself during attach, and deregisters itself during detach. One can observe with a DEBUG kernel that during boot, for whatever reason, the system will attach and then successfully detach the TPM driver (and then re-attach a few seconds later). I've seen this attach/detach/attach behavior with other drivers, so it doesn't appear to be something specific to the TPM driver (and regardless, the driver should be able to attach/detach w/o causing problems).
I've verified the tpm driver attaches successfully, and omitting the KCF registration/deregistration, the driver is works fine (there is still some cleanup and missing functionality before it'd be ready for review, adding the KCF support was meant as an additional way to exercise the driver for testing). That is at this point, there doesn't seem to be anything the driver is doing wrong that'd cause this.
Looking at the crash dumps that were generated, what is often seen is that the kcf
prov_tab has two entries for the TPM provider. The TPM soft state address is used as the provider private data. Looking at the values of both TPM entries (along with making note of the tpm soft state address during each attach attempt), one sees that one TPM provider has the address of the current TPM instance, while the other has the address of the previously attached instance (and it's the attempt to use this one that leads to a panic since the soft state is obviously no longer valid).
In another instance, it appears the RNG system tries to use the TPM after the driver has been detached, but has not yet been reattached:
> ::prtconf -d tpm DEVINFO NAME
However, the RNG system is still trying to call into the TPM driver:
> ::stack mutex_enter+0xb() tpm20_generate_random+0x141(fffffeb1cf059d00, fffffe00f7824748, 14) tpmrng_generate_random+0x65(fffffeb1cf059d00, 0, fffffe00f7824748, 14, fffffeb1d1ac3e80) common_submit_request+0x3c0(fffffeb1ce9a8040, 0, fffffe00f7824648, fffffeb1d1ac3e80) process_req_hwp+0x215(fffffeb1d1ac3e80) kcf_submit_request+0x354(fffffeb1ce9a8040, 0, 0, fffffe00f7824648, 0) rngprov_getbytes+0xa3(fffffe00f7824748, 14, 1) rngprov_task+0x30(14) taskq_d_thread+0x1ac(fffffeb1d24459f8) thread_start+0xb()
Looking at the TPM soft state, it very much does not look like like a valid attached tpm instance:
> fffffeb1cf059d00::print tpm_t tpm_dip tpm_seq tpm_dip = 0xfffffeb1ae1b08c0 tpm_seq = 0 (TPM_ATTACH_REGS) > 0xfffffeb1ae1b08c0::devinfo fffffeb1ae1b08c0 tpm, instance #0 (driver not attached) Hardware properties at fffffeb1ae35a810: name='compatible' type=string items=1 value='pnpMSF,0' name='reg' type=int items=3 value=00000000.fed40000.00005000 name='acpi-namespace' type=string items=1 value='\_SB_.TPM_' name='model' type=string items=1 value='Generic Trusted Platform Module 2.0 Module'
Updated by Jason King 11 days ago
Unfortunately, the KCF system seems rather complicated where providers are references along multiple dimensions, with a number of per-CPU locks, as well as other locks, and I've not yet found any comments for guidance on how they're all supposed to interact).
However a few things that seem of potential interest:
crypto_deregister_provider() basically sets the provider state to
KCF_PROVIDER_UNREGISTERING, then proceeds to wait until any outstanding requests complete to then set
KCF_PROVIDER_UNREGISTERED. If /dev/crypto has used the provider, it does a deferred cleanup. A reasonable assumption is that once the provider is in the unregistering state, consumers should not attempt to use it. Notably, this state appears to be set while
pd_lock is held. At kcf_spi.c:538, it explicitly acquires pd_lock for the sole purpose of changing
pd_state -- a strong suggestion that
pd_state should be protected by
kcf_get_mech_provider does not appear to acquire
pd_lock when checking the state of a provider. While it seems acceptable to not acquire the lock to quickly dismiss any unregistering or unregistered providers (once in those 'unregistering' states, it doesn't seem like a provider should revert back to a non-unregistering state), the
KCF_IS_PROV_USABLE macro (which examines
pd_state), never appears to be used while
pd_lock is held in
kcf_get_mech_provider never seems to acquire
pd_lock at all. This seems like this is likely at least part of the problem, however given the complexity here, I'm not yet sure if it's the entire problem.
kcf_get_mech_provider does acquire a per-cpu lock that seems to be related to the mechanisms somehow but is not the per-cpu locks that each provider has that seem to be related to queueing requests. I've not yet been able to figure out yet what these lock's relation to
pd_lock (e.g. lock order, etc) is.
On top of all of this, there is also a per-cpu reference count protected by per-cpu mutexes.
It feels like maybe
kcf_get_mech_provider should acquire
pd_lock to verify the status of the provider, and take a reference while holding the lock, but it's not clear (yet) if that's a safe order to acquire the mutexes or not.