Project

General

Profile

Bug #9823

Deadlock in ACPI Method Evaluation

Added by Robert Mustacchi about 1 year ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2018-09-09
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:

Description

We've seen on several different systems and boots (https://github.com/joyent/smartos-live/issues/727) cases where we appear to hang in ACPI. I reproduced this on an Intel NUC NUC7i3BNH.

To first diagnose this, I tried to get a few more verbose items. On one of the attempts I would turn on moddebug and set it to include MODDEBUG_ERRMSG (0x40000000). Through this I saw that we were loading an ACPI related table, or at least attempting an override, as one of the last things that we executed. With this in mind, I decided to look closer at the ACPI code. By default the ACPI code on a debug build will log various informational messages; however, only to the system log. Unfortunately, given that we were hanging in boot and not able to use the local keyboard / mouse attached to the NUC in kmdb, that left us in a bit of a bad situation.

To work around this, I first changed the call to strlog() in acpica_pr_buf() to no longer have the flag SL_LOGONLY. This allowed us to make forward progress and see what was going on. Based on this I was able to observe that we were trying to execute a method as the last thing that we could see. To get to the next layer, I had found a message that we had loaded an OEM table ("Dynamic OEM Table Load:", which meant that we had either hit AcpiExLoadOp or AcpiExLoadTableOp.

From here, I iterated a bit more and started enable more advanced ACPI debugging when we hit these issues. Specifically, I would do something like ::bp -c 'AcpiDbgLevel/W 0xf007ff5f; :c' -d acpica`AcpiExLoadOp. With this, I got an additional lead and went back and enabled ACPI's function level debugging. I was able to determine that we were in AcpiNsEvaluate, which makes sense because we were trying to evaluate a method, and that the last ACPI function we executed was AcpiExEnterInterpreter and AcpiUtAcquireMutex. This was interesting, we didn't return out of these. I immediately suspected a deadlock was occurring.

I rebuilt the acpica module adding the macro ACPI_MUTEX_DEBUG and lo and behold, we no longer hung. This is of course because this would cause the attempt to grab the lock to fail before we deadlocked and the fact that it failed was ignored entirely. This made it clear that we were actually triggering a deadlock here. I went back and modified the caller of this, AcpiExEnterInterpreter to panic if we failed to get the lock and started rigging up a few dcmds to run by setting a breakpoint on panic with a command to execute (recall that USB was DOA at this point). Based on that, I was able to put together the following stack trace (note this was manually coppied, so offsets were dropped):

AcpiNsEvaluate
AcpiNsExecModuleCode
AcpiNsExecModuleCodeList
AcpiNsLoadTable
AcpiExAddTable
AcpiExLoadOp
AcpiExOpcode_1A_IT-OR
AcpiDsExecEndOp
AcpiPsParseLoop
AcpiPsParseAml
AcpiPsExecuteMethod
AcpiNsEvaluate
AcpiEvaluateObject
cpu_acpi_write_pdc
cpupm_intel_init
cpupm_init
post_startup
main
locore_start

The bulk of this effort has involved testing on a wide variety of systems to ensure that we don't have regressions. We've tested on a wide variety of platforms which are wrapped up in the table below. Importantly we've tested most generations of Intel processors and one or two AMD ones. This covers everything from the Intel Core 2 Duo to Coffee Lake.

Location System MFGN Processor Mobo Notes
Joyent Haswell SMCI 1U INTC E3-1270v3 X10SLM+-LN4F -
Joyent Ivy SMCI 1U E3-1270v2 X9SCL/X9SCM -
Joyent sky1 (Joyent-M12G5) SMCI 2U Gold 6132 X11DPH-T -
Joyent magma (Joyent-S10G5) SMCI 4U Gold 6132 X11DPH-T -
Joyent lava (SSG-2028R-ACR24L) SMCI 2U E5-2690v4 X10DRH-iT -
Joyent estel (NUC7i3BNH) Intel NUC (SKL) Core i3-7100U NUC7i3BNB Now boots!
Joyent NVMe (SYS-2028U-E1CNRT+) SMCI 2U E5-2620v4 X10DRU-i+ -
Joyent HP DL360p g8 HP 1U CPU E5-2680 0 @ 2.70GHz - Table warnings always existed
Joyent odyssey SMCI 2U EPYC 7601 - -
Joyent vmware fusion 8.5.10 Apple Core(tm) i7-3635QM - -
joyent/smartos-live#727 X1 Carbon Lenovo ? ? Now Boots!
joyent/smartos-live#727 - - Intel core i5 8600K (Coffee Lake) Asrock z370 extreme4 Now boots!
joyent/smartos-live#727 - - i3-6100T Asus H110S2 Now Boots!
joyent/smartos-live#727 - - i3-6100T MSI C236M Now Boots!
joyent/smartos-live#727 DL320 G6 HP X5650 - -
joyent/smartos-live#727 ML110 G7 HP E3-1220 - -
joyent/smartos-live#727 - SMCI e3-1220v2 x9scm-f -
smartos-discuss - SMCI Gold 6140 X11DPU -
Individual T630 Dell Intel Xeon E5-2630 v3 - -
Individual - - Intel Core i7-3970X DX79TO Some new fw mismatch warnings
Joyent Volcano (R730) Dell 2690v4 - -
Individual Precision T1600 Dell XEon E3-1270 - -
Individual - Asrock Atom C2550 C2550D4I -
Individual Terca M10 Toshiba Core 2 Duo P8600 - -
IRC - Supermicro X11SSH-CTf Kaby Lake Now Boots
joyent/smartos-live#727 - SuperMicro X11SRA-RF Xeon W-2145 Skylake W Now boots
joyent/smartos-live#727 NUC 7i5BNH Intel Intel Core i5 7260U - Now Boots

With this, everything became a lot clearer. As part of evaluating the ACPI CPU power management features we had reached this point where the method caused us to need to execute and load another table. However, the locking inside the internal ACPI code base was incorrect which lead us to already have the lock that we were trying to drop.

The fix for this is a bit involved, but thankfully has already been addressed in the upstream ACPICA logic. As such, what probably makes the most sense is updating our ACPICA code. With this update in place, we have successfully been able to boot and execute on the kaby lake processor.


Related issues

Related to illumos gate - Feature #9824: Update ACPI to joyent/20180629Closed2018-09-09

Actions

History

#1

Updated by Robert Mustacchi about 1 year ago

#2

Updated by Electric Monk 9 months ago

  • Status changed from New to Closed

git commit 35786f6866ae52207d0f1a25fe7ca5f652f32ce0

commit  35786f6866ae52207d0f1a25fe7ca5f652f32ce0
Author: Robert Mustacchi <rm@joyent.com>
Date:   2019-01-10T20:11:18.000Z

    9823 Deadlock in ACPI Method Evaluation
    9824 Update ACPI to joyent/20180629
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Andy Fiddaman <af@citrus-it.net>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF