Project

General

Profile

Actions

Bug #16069

open

kcf can return handle to deregistered providers

Added by Jason King 11 days ago. Updated 11 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

While working on updating the TPM driver to support TPM2.0 modules, as part of the the testing, I added support for registering the TPM as a RNG provider (largely based off the existing code in the driver).
However, with that, the system panics.

The driver itself (as far as KCF is concerned) is fairly simple: it registers itself during attach, and deregisters itself during detach. One can observe with a DEBUG kernel that during boot, for whatever reason, the system will attach and then successfully detach the TPM driver (and then re-attach a few seconds later). I've seen this attach/detach/attach behavior with other drivers, so it doesn't appear to be something specific to the TPM driver (and regardless, the driver should be able to attach/detach w/o causing problems).

I've verified the tpm driver attaches successfully, and omitting the KCF registration/deregistration, the driver is works fine (there is still some cleanup and missing functionality before it'd be ready for review, adding the KCF support was meant as an additional way to exercise the driver for testing). That is at this point, there doesn't seem to be anything the driver is doing wrong that'd cause this.

Looking at the crash dumps that were generated, what is often seen is that the kcf prov_tab has two entries for the TPM provider. The TPM soft state address is used as the provider private data. Looking at the values of both TPM entries (along with making note of the tpm soft state address during each attach attempt), one sees that one TPM provider has the address of the current TPM instance, while the other has the address of the previously attached instance (and it's the attempt to use this one that leads to a panic since the soft state is obviously no longer valid).

In another instance, it appears the RNG system tries to use the TPM after the driver has been detached, but has not yet been reattached:

> ::prtconf -d tpm
DEVINFO          NAME

However, the RNG system is still trying to call into the TPM driver:

> ::stack
mutex_enter+0xb()
tpm20_generate_random+0x141(fffffeb1cf059d00, fffffe00f7824748, 14)
tpmrng_generate_random+0x65(fffffeb1cf059d00, 0, fffffe00f7824748, 14, fffffeb1d1ac3e80)
common_submit_request+0x3c0(fffffeb1ce9a8040, 0, fffffe00f7824648, fffffeb1d1ac3e80)
process_req_hwp+0x215(fffffeb1d1ac3e80)
kcf_submit_request+0x354(fffffeb1ce9a8040, 0, 0, fffffe00f7824648, 0)
rngprov_getbytes+0xa3(fffffe00f7824748, 14, 1)
rngprov_task+0x30(14)
taskq_d_thread+0x1ac(fffffeb1d24459f8)
thread_start+0xb()

Looking at the TPM soft state, it very much does not look like like a valid attached tpm instance:

> fffffeb1cf059d00::print tpm_t tpm_dip tpm_seq
tpm_dip = 0xfffffeb1ae1b08c0
tpm_seq = 0 (TPM_ATTACH_REGS)
> 0xfffffeb1ae1b08c0::devinfo
fffffeb1ae1b08c0 tpm, instance #0 (driver not attached)
        Hardware properties at fffffeb1ae35a810:
            name='compatible' type=string items=1
                value='pnpMSF,0'
            name='reg' type=int items=3
                value=00000000.fed40000.00005000
            name='acpi-namespace' type=string items=1
                value='\_SB_.TPM_'
            name='model' type=string items=1
                value='Generic Trusted Platform Module 2.0 Module'
Actions #1

Updated by Jason King 11 days ago

Unfortunately, the KCF system seems rather complicated where providers are references along multiple dimensions, with a number of per-CPU locks, as well as other locks, and I've not yet found any comments for guidance on how they're all supposed to interact).

However a few things that seem of potential interest:

crypto_deregister_provider() basically sets the provider state to KCF_PROVIDER_UNREGISTERING, then proceeds to wait until any outstanding requests complete to then set KCF_PROVIDER_UNREGISTERED. If /dev/crypto has used the provider, it does a deferred cleanup. A reasonable assumption is that once the provider is in the unregistering state, consumers should not attempt to use it. Notably, this state appears to be set while pd_lock is held. At kcf_spi.c:538, it explicitly acquires pd_lock for the sole purpose of changing pd_state -- a strong suggestion that pd_state should be protected by pd_lock.

Notably though, kcf_get_mech_provider does not appear to acquire pd_lock when checking the state of a provider. While it seems acceptable to not acquire the lock to quickly dismiss any unregistering or unregistered providers (once in those 'unregistering' states, it doesn't seem like a provider should revert back to a non-unregistering state), the KCF_IS_PROV_USABLE macro (which examines pd_state), never appears to be used while pd_lock is held in kcf_get_mech_provider.

kcf_get_mech_provider never seems to acquire pd_lock at all. This seems like this is likely at least part of the problem, however given the complexity here, I'm not yet sure if it's the entire problem. kcf_get_mech_provider does acquire a per-cpu lock that seems to be related to the mechanisms somehow but is not the per-cpu locks that each provider has that seem to be related to queueing requests. I've not yet been able to figure out yet what these lock's relation to pd_lock (e.g. lock order, etc) is.

On top of all of this, there is also a per-cpu reference count protected by per-cpu mutexes.

It feels like maybe kcf_get_mech_provider should acquire pd_lock to verify the status of the provider, and take a reference while holding the lock, but it's not clear (yet) if that's a safe order to acquire the mutexes or not.

Actions

Also available in: Atom PDF