apix module mistakingly sets TPR
(Paraphrasing Robert's email) When spl is raised within a spinlock during a cross-call, on systems with x2apics (mostly Sandy Bridge), it can cause a hard hang.
To determine if a system is susceptible, run mdb -k and issue 'setspl/p'. It 'apix_setspl' appears in the output, then the system is vulnerable. If apic_setspl is seen, then the system is not.
Workaround is either add 'set apix_enable = 0' to /etc/system, or via mdb -kd, 'apix_enable/W 0' before the system boots.
Updated by Robert Mustacchi almost 9 years ago
- Subject changed from System hangs in splr() within a spinlock during cross-calls on x2apic systems to apix modules mistakingly sets TPR
- % Done changed from 0 to 90
- Tags deleted (
The task priority register (TPR) register on x86 is a part of the local apic and x2apic where by you can set a value ranging from 0-15 which is used to mask interrupts equal to or below that level. That register is then compared to the current dispatched interrupt and the combination of them determines the processor priority register (PPR) which dictates whether or not an interrupt of a specific level is masked or not. This is combined with the bit in the flags register that is controlled by the sti and cli instructions which determine whether or not the processor will receive an interrupt.
To exercise this we wrote a small psuedo driver which would try and do spin locks during cross calls at a high pil. If you boot with kmdb enabled and set a breakpoint rather early in
apic_nmi_intr(), you can successfully break into kmdb with an nmi and look at the associated state of the apix.
> *apic_cpus,0t32::print -at apic_cpus_info_t aci_curipl ffffff23125c6813 uchar_t aci_curipl = 0xf ffffff23125c6853 uchar_t aci_curipl = 0xf ffffff23125c6893 uchar_t aci_curipl = 0xf ffffff23125c68d3 uchar_t aci_curipl = 0xf ffffff23125c6913 uchar_t aci_curipl = 0xf ffffff23125c6953 uchar_t aci_curipl = 0xf ffffff23125c6993 uchar_t aci_curipl = 0xf ffffff23125c69d3 uchar_t aci_curipl = 0xf ffffff23125c6a13 uchar_t aci_curipl = 0xf ffffff23125c6a53 uchar_t aci_curipl = 0xf ffffff23125c6a93 uchar_t aci_curipl = 0xf ffffff23125c6ad3 uchar_t aci_curipl = 0xf ffffff23125c6b13 uchar_t aci_curipl = 0xf ffffff23125c6b53 uchar_t aci_curipl = 0xf ffffff23125c6b93 uchar_t aci_curipl = 0xf ffffff23125c6bd3 uchar_t aci_curipl = 0xf ffffff23125c6c13 uchar_t aci_curipl = 0xf ffffff23125c6c53 uchar_t aci_curipl = 0xf ffffff23125c6c93 uchar_t aci_curipl = 0xf ffffff23125c6cd3 uchar_t aci_curipl = 0xf ffffff23125c6d13 uchar_t aci_curipl = 0xf ffffff23125c6d53 uchar_t aci_curipl = 0xf ffffff23125c6d93 uchar_t aci_curipl = 0xf ffffff23125c6dd3 uchar_t aci_curipl = 0xf ffffff23125c6e13 uchar_t aci_curipl = 0xf ffffff23125c6e53 uchar_t aci_curipl = 0xf ffffff23125c6e93 uchar_t aci_curipl = 0xf ffffff23125c6ed3 uchar_t aci_curipl = 0xf ffffff23125c6f13 uchar_t aci_curipl = 0xf ffffff23125c6f53 uchar_t aci_curipl = 0xf ffffff23125c6f93 uchar_t aci_curipl = 0xf ffffff23125c6fd3 uchar_t aci_curipl = 0xf
From here, we can actually track back how this happens. Recall that apix code setspl is a bit odd. In particular it has the following little bit:
763 /* 764 * Mask all interrupts for XC_HI_PIL (i.e set TPR to 0xf). 765 * Otherwise, enable all interrupts (i.e. set TPR to 0). 766 */ 767 if (ipl != XC_HI_PIL) 768 ipl = 0; 769
Furthermore, the apix implementations of setlvl and setlvlx do not even attempt or pretend to call setspl. pcplusmp does, for example, but in its own way.
Now we have recently grabbed a spin lock during a cross call. That means that we will have a splr and splx at the appropriate level. splr sets the new priority level and returns the current priority level as based upon the metadata that we have about what IPL we think we are at.
The setlvl and setlvlx functions do manipulate the values that we use to determine what ipl we are currently at (e.g. we don't read the registers in question, we just use our metadata in the CPU and apic/apix data structures: cpu_infop->aci_curipl).
So we come in, grab the spin lock, which causes us to call splr 15. We explicitly set spl to 15 and return what we think the current one is: 15. When we are done, we call splx which sets the spl back to 15 and now we return to our normal interrupt handling path. However, here's the problem, there's nothing here that will go back and manipulate the spl which means we can leave the interrupt handler with the spl at 15!
Now at this point, we have a chance of getting a get out of jail free card. If we end up processing softints and we lose our interrupt thread, then we will switch out and splhigh to twelve thus being safe. However, if we don't and return through the interrupt gate, then we can end up calling i86_wait with the TPR still set to 15 and thus basically kill ourselves.
Based on this we went and evaluated the apix driver implementation and this is Jerry's follow up.
This code does not use the TPR to mask interrupts, except in the case of the *spl (apix_setspl & x2apix_setspl) function, which is the root cause of the bug. The code is designed to allow interrupts (potentially lower-level) to come in while we are processing interrupts. Since we don't use the TPR, we depend on the IF Flag to block interrupts. At certain points we use sti/cli to allow other interrupts to come in. We either handle those or treat them as pending interrupts which are handled later.
So to reiterate the bug, since *setspl is using the TPR, if we do the spin lock while we are at IPL 15, we mask ints using the TPR, but reset the level back to 15 so we never unmask the ints in the TPR. Since no other code uses the TPR, we never clear back down.
The comments in *setspl are incorrect. We only ever are using the TPR to either mask all ints or none of them.
So, I think the proposed fix is to never use the TPR, including in the *setspl code. We should just be consistent and use the IF flag to either mask or pass all interrupts.