Deadlock in ACPI Method Evaluation
We've seen on several different systems and boots (https://github.com/joyent/smartos-live/issues/727) cases where we appear to hang in ACPI. I reproduced this on an Intel NUC NUC7i3BNH.
To first diagnose this, I tried to get a few more verbose items. On one of the attempts I would turn on moddebug and set it to include MODDEBUG_ERRMSG (0x40000000). Through this I saw that we were loading an ACPI related table, or at least attempting an override, as one of the last things that we executed. With this in mind, I decided to look closer at the ACPI code. By default the ACPI code on a debug build will log various informational messages; however, only to the system log. Unfortunately, given that we were hanging in boot and not able to use the local keyboard / mouse attached to the NUC in kmdb, that left us in a bit of a bad situation.
To work around this, I first changed the call to strlog() in acpica_pr_buf() to no longer have the flag SL_LOGONLY. This allowed us to make forward progress and see what was going on. Based on this I was able to observe that we were trying to execute a method as the last thing that we could see. To get to the next layer, I had found a message that we had loaded an OEM table ("Dynamic OEM Table Load:", which meant that we had either hit AcpiExLoadOp or AcpiExLoadTableOp.
From here, I iterated a bit more and started enable more advanced ACPI debugging when we hit these issues. Specifically, I would do something like ::bp -c 'AcpiDbgLevel/W 0xf007ff5f; :c' -d acpica`AcpiExLoadOp. With this, I got an additional lead and went back and enabled ACPI's function level debugging. I was able to determine that we were in AcpiNsEvaluate, which makes sense because we were trying to evaluate a method, and that the last ACPI function we executed was AcpiExEnterInterpreter and AcpiUtAcquireMutex. This was interesting, we didn't return out of these. I immediately suspected a deadlock was occurring.
I rebuilt the acpica module adding the macro ACPI_MUTEX_DEBUG and lo and behold, we no longer hung. This is of course because this would cause the attempt to grab the lock to fail before we deadlocked and the fact that it failed was ignored entirely. This made it clear that we were actually triggering a deadlock here. I went back and modified the caller of this, AcpiExEnterInterpreter to panic if we failed to get the lock and started rigging up a few dcmds to run by setting a breakpoint on panic with a command to execute (recall that USB was DOA at this point). Based on that, I was able to put together the following stack trace (note this was manually coppied, so offsets were dropped):
AcpiNsEvaluate AcpiNsExecModuleCode AcpiNsExecModuleCodeList AcpiNsLoadTable AcpiExAddTable AcpiExLoadOp AcpiExOpcode_1A_IT-OR AcpiDsExecEndOp AcpiPsParseLoop AcpiPsParseAml AcpiPsExecuteMethod AcpiNsEvaluate AcpiEvaluateObject cpu_acpi_write_pdc cpupm_intel_init cpupm_init post_startup main locore_start
The bulk of this effort has involved testing on a wide variety of systems to ensure that we don't have regressions. We've tested on a wide variety of platforms which are wrapped up in the table below. Importantly we've tested most generations of Intel processors and one or two AMD ones. This covers everything from the Intel Core 2 Duo to Coffee Lake.
|Joyent||Haswell||SMCI 1U||INTC E3-1270v3||X10SLM+-LN4F||-|
|Joyent||sky1 (Joyent-M12G5)||SMCI 2U||Gold 6132||X11DPH-T||-|
|Joyent||magma (Joyent-S10G5)||SMCI 4U||Gold 6132||X11DPH-T||-|
|Joyent||lava (SSG-2028R-ACR24L)||SMCI 2U||E5-2690v4||X10DRH-iT||-|
|Joyent||estel (NUC7i3BNH)||Intel NUC (SKL)||Core i3-7100U||NUC7i3BNB||Now boots!|
|Joyent||NVMe (SYS-2028U-E1CNRT+)||SMCI 2U||E5-2620v4||X10DRU-i+||-|
|Joyent||HP DL360p g8||HP 1U||CPU E5-2680 0 @ 2.70GHz||-||Table warnings always existed|
|Joyent||odyssey||SMCI 2U||EPYC 7601||-||-|
|Joyent||vmware fusion 8.5.10||Apple||Core(tm) i7-3635QM||-||-|
|joyent/smartos-live#727||X1 Carbon||Lenovo||?||?||Now Boots!|
|joyent/smartos-live#727||-||-||Intel core i5 8600K (Coffee Lake)||Asrock z370 extreme4||Now boots!|
|joyent/smartos-live#727||-||-||i3-6100T||Asus H110S2||Now Boots!|
|joyent/smartos-live#727||-||-||i3-6100T||MSI C236M||Now Boots!|
|Individual||T630||Dell||Intel Xeon E5-2630 v3||-||-|
|Individual||-||-||Intel Core i7-3970X||DX79TO||Some new fw mismatch warnings|
|Individual||Precision T1600||Dell||XEon E3-1270||-||-|
|Individual||Terca M10||Toshiba||Core 2 Duo P8600||-||-|
|IRC||-||Supermicro||X11SSH-CTf||Kaby Lake||Now Boots|
|joyent/smartos-live#727||-||SuperMicro||X11SRA-RF||Xeon W-2145 Skylake W||Now boots|
|joyent/smartos-live#727||NUC 7i5BNH||Intel||Intel Core i5 7260U||-||Now Boots|
With this, everything became a lot clearer. As part of evaluating the ACPI CPU power management features we had reached this point where the method caused us to need to execute and load another table. However, the locking inside the internal ACPI code base was incorrect which lead us to already have the lock that we were trying to drop.
The fix for this is a bit involved, but thankfully has already been addressed in the upstream ACPICA logic. As such, what probably makes the most sense is updating our ACPICA code. With this update in place, we have successfully been able to boot and execute on the kaby lake processor.
Updated by Electric Monk over 1 year ago
- Status changed from New to Closed
commit 35786f6866ae52207d0f1a25fe7ca5f652f32ce0 Author: Robert Mustacchi <email@example.com> Date: 2019-01-10T20:11:18.000Z 9823 Deadlock in ACPI Method Evaluation 9824 Update ACPI to joyent/20180629 Reviewed by: Patrick Mooney <firstname.lastname@example.org> Reviewed by: Jerry Jelinek <email@example.com> Reviewed by: Andy Fiddaman <firstname.lastname@example.org> Approved by: Richard Lowe <email@example.com>