Project

General

Profile

Actions

Bug #12473

closed

Missing shutdown messages on AMD Ryzen

Added by Gary Mills over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
XNV (X Window System)
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:

Description

Since approximately December 2019, the shutdown messages have disappeared on my AMD Ryzen systems. In the tests, outlined below, I powered off the system from the GUI, and observed the screen.

In a normal shutdown, the screen displayed all the console messages, starting from the last boot. At the conclusion, it powered off the system.

For the abnormal systems, I got a black screen, with a single white rectangle in the upper left corner. After some time, the system powered off.

The last line of /var/log/Xorg.0.log.old, on systems that displayed the shutdown messages correctly, was like this:

Server terminated successfully (0). Closing log file.

This message was printed to the log file about two seconds after the previous message.

On systems that did not display the shutdown messages, this line was missing from the Xorg log file.

This behavior is quite reproducible: it happens on any AMD Ryzen system. Two of my systems behaved normally. They are:

An Intel system, for comparison. It ran OI from March 2020. The CPU was an Intel(r) Xeon(r) W3550. The video card was a Radeon HD 2400 PRO/XT, operating at 1280x1024.

An AMD system. This one had an AMD Ryzen 3 1200. The video card again was a Radeon HD 7450, operating at 1280x1024. For this test, it ran OI from September 2018.

Two of them misbehaved. They are:

The same AMD system as above, running OI from March 2020

Another AMD system. It also ran OI from March 2020. The CPU was an AMD Ryzen 3 2200G. The video card was a Radeon HD 7450, also operating at 1280x1024.

I conclude that this misbehavior only happens with AMD systems, running a recent version of OI. It seems that, on these systems, Xorg terminates early, before it is able to write the last line of the log file.

I suspect that the problem is with Xorg itself, although more debugging will be required to make a positive identification.


Files

30-xserver-term.patch (671 Bytes) 30-xserver-term.patch Gary Mills, 2020-04-12 10:11 PM
30-xserver-term.patch (538 Bytes) 30-xserver-term.patch Gary Mills, 2020-04-13 11:54 PM
Actions #1

Updated by Gary Mills over 2 years ago

I found a clue in the lightdm logs. Here's an example:

[+0.64s] DEBUG: Launching process 1062: /usr/bin/Xorg :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
[+21.16s] DEBUG: Got signal 16 from process 1062
[+168200.88s] DEBUG: Sending signal 15 to process 1062
[+168205.88s] DEBUG: Sending signal 9 to process 1062
[+168207.92s] DEBUG: Process 1062 terminated with signal 9

Clearly, lightdm sent the software termination signal (SIGTERM) to Xorg, and then five seconds later, sent the SIGKILL signal to the same process, causing immediate termination of Xorg. That would explain why Xorg was unable to reset the graphics device back into text mode. The wait() should occur after the first signal, not after the second one.

Actions #2

Updated by Gary Mills over 2 years ago

I have a patch for lightdm, 30-xserver-term.patch, that fixes this problem. When the package is rebuilt with this patch, /var/log/lightdm/lightdm.log.old has this content:

[+0.69s] DEBUG: Launching process 1099: /usr/bin/Xorg :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
[+21.20s] DEBUG: Got signal 16 from process 1099
[+1027.62s] DEBUG: Sending signal 15 to process 1099
[+1034.73s] DEBUG: Process 1099 exited with return value 0

It takes 6 or 7 seconds for Xorg to terminate correctly. The five-second delay built into lightdm is not quite sufficient. At least with Xorg, signal 15 is sufficient to cause it to terminate. There's no need to send signal 9. Perhaps it is needed for ill-designed xservers, but Xorg is not one of them.

Actions #3

Updated by Alexander Pyhalov over 2 years ago

I'm not sure that completely disabling SIGKILL is fine (imagine some hanging Xorg).

Perhaps, making this timeout configurable and increasing it by default (for example, to 1 minute) would be enough?

Actions #4

Updated by Gary Mills over 2 years ago

Yes, 60 seconds would be better. Has Xorg ever hung? It hasn't in my experience. SIGTERM is documented as the way to make Xorg clean up and exit.

Just making the timeout longer is simple but insufficient. The sleep would have to be interrupted when Xorg exited. Otherwise, lightdm would always sleep for the entire time. That's undesirable too. The only way to interrupt the sleep would be SIGCLD, but I expect it to be sent after the wait. I don't know of another way. As well, SIGKILL would have to be sent only if the entire sleep time was used up. I don't know how you would arrange that.

Just sending SIGTERM seems to be the best way. Everything else gets complicated. It is shutdown after all.

Actions #5

Updated by Jorge Schrauwen over 2 years ago

You could make lightdm spin on 1 sec loop 10 times and check each time if xorg has exited?
That would waste some CPU though, but might be worth the tradeoffs?

Actions #6

Updated by Gary Mills over 2 years ago

Yes, that is the complex alternative. It would have to be a 5-second sleep done 12 times, though. The check could use waitpid() with WNOHANG and WNOWAIT options set. It would send SIGKILL only if the loop completed. The question to answer first is: is all of this complexity necessary?

Actions #7

Updated by Alexander Pyhalov over 2 years ago

Will longer timeout efficiently make lightdm wait for full timeout?

Actions #8

Updated by Gary Mills over 2 years ago

No, a properly designed timeout loop would stop looping as soon as it detected that the process had terminated.

Did you want me to develop one?

Actions #9

Updated by Alexander Pyhalov over 2 years ago

No, I mean, if we just increase timeout, will lightdm actually wait for full timeout duration or callback added by g_timeout_add() just will not fire?

Actions #10

Updated by Alexander Pyhalov over 2 years ago

Alexander Pyhalov wrote:

No, I mean, if we just increase timeout, will lightdm actually wait for full timeout duration or callback added by g_timeout_add() just will not fire?

What happens if you just replace
priv->quit_timeout = g_timeout_add (5000, (GSourceFunc) quit_timeout_cb, process);
with
priv->quit_timeout = g_timeout_add (60000, (GSourceFunc) quit_timeout_cb, process);
?

Actions #11

Updated by Gary Mills over 2 years ago

Ah, just changing the timeout will work. The function process_stop() stores the ID returned by g_timeout_add() into a private memory area. The function process_watch_cb() calls g_source_remove() which removes the ID and callback function from the queue. So, as long as the interval has not elapsed and the callback function has not run, it will never run.

Making the timeout longer is a good solution. I'll develop another patch. It'll be even simpler

Actions #12

Updated by Gary Mills over 2 years ago

I have a better patch now, one that's even simpler. It's still called 30-xserver-term.patch, and needs to be installed in the patches directory of the OI lightdm source. I'll attach a copy of this revised patch.

I've tested it on all three of the systems that I mentioned originally. The two AMD Ryzen systems take 6 or 7 seconds for Xorg to terminate on receipt of signal 15. On the Intel system, the same operation takes only about 2 seconds. The potential 60 second delay before signal 9 is sent to Xorg should be adequate for any system now. It will only be sent if Xorg refuses to exit on receipt of signal 15. The original 5 second delay was too short for the two Ryzen systems.

Actions #13

Updated by Alexander Pyhalov over 2 years ago

  • Status changed from New to Resolved
  • Assignee changed from OI XNV to Gary Mills
Actions

Also available in: Atom PDF