Bug #7637
closedrestorecontext(ucontext_t *ucp) leaves all maskable signals blocked in curthread->t_hold
100%
Description
Symptom: process gets killed upon SIGSEGV even though a handler is registered.
Dtrace analysis: after return from a signal handler restorecontext(ucontext_t *ucp) leaves all maskable signals blocked in curthread->t_hold. So upon the next SIGSEGV the process gets killed.
I tracked down this bug in Solaris 11 which caused my server process to crash in very rare occasions. I used the Illumos/OpenIndiana sources and dtrace to do the analysis. This is a while ago, and Oracle has fixed the bug in Solaris 11 already, but only today I used an OpenIndiana Hipster 2016.10 live system on VirtualBox to run my test and found that there the bug is still present.
The issue can be reproduced very easily with the small test attached.
A detailed description is included in solaris11_restorecontext_bug.c
Important: to trigger the bug, the test must run at least on 2 physical cpus with 2 threads.
Files
Updated by Robert Mustacchi over 6 years ago
Thank you for reporting this and providing the detailed analysis and reproduction code. We'll dig into this.
Updated by Patrick Mooney over 6 years ago
After tracing this out for a while, I believe I've found the crux of the issue. The comment for schedctl_finish_sigblock notes that it should either be called by curthread or, if in an "external" thread, while holding p_lock. While the p_lock requirement does synchronize external callers, it does not address the possibility that signal-related code in curthread will race when updating t_hold. The test program achieves this by hammering schedctl_finish_sigblock via the procfs facilities for querying LWP status.
Updated by Patrick Mooney over 6 years ago
- Status changed from New to In Progress
- Assignee set to Patrick Mooney
Updated by Patrick Mooney over 6 years ago
Opened as OS-5834 for the purpose of burn-in downstream before integration into the gate.
Updated by Jason King almost 3 years ago
- File joyent-test.tar joyent-test.tar added
Attaching the smaller test case from OS-5834
Updated by Jason King almost 3 years ago
From Joyent OS-5834:
After tracing this out for a while, I believe I've found the crux of the issue. The comment for schedctl_finish_sigblock notes that it should either be called by curthread or, if in an "external" thread, while holding p_lock. While the p_lock requirement does synchronize external callers, it does not address the possibility that signal-related code in curthread will race when updating t_hold. The test program achieves this by hammering schedctl_finish_sigblock via the procfs facilities for querying LWP status.
Updated by Jason King almost 3 years ago
I re-ran the Joyent test case -- on a PI without the fix, it crashes as expected, with the fix in place the test case no longer crashes.
Updated by Electric Monk almost 3 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit 72a6dc127431d372b6b6136087c736300544f8b7
commit 72a6dc127431d372b6b6136087c736300544f8b7 Author: Patrick Mooney <pmooney@pfmooney.com> Date: 2020-04-09T00:41:51.000Z 7637 restorecontext(ucontext_t *ucp) leaves all maskable signals blocked in curthread->t_hold Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@joyent.com>