Project

General

Profile

Bug #7637

restorecontext(ucontext_t *ucp) leaves all maskable signals blocked in curthread->t_hold

Added by Richard Reingruber over 3 years ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2016-12-02
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

Symptom: process gets killed upon SIGSEGV even though a handler is registered.

Dtrace analysis: after return from a signal handler restorecontext(ucontext_t *ucp) leaves all maskable signals blocked in curthread->t_hold. So upon the next SIGSEGV the process gets killed.

I tracked down this bug in Solaris 11 which caused my server process to crash in very rare occasions. I used the Illumos/OpenIndiana sources and dtrace to do the analysis. This is a while ago, and Oracle has fixed the bug in Solaris 11 already, but only today I used an OpenIndiana Hipster 2016.10 live system on VirtualBox to run my test and found that there the bug is still present.

The issue can be reproduced very easily with the small test attached.

A detailed description is included in solaris11_restorecontext_bug.c

Important: to trigger the bug, the test must run at least on 2 physical cpus with 2 threads.


Files

solaris11_restorecontext_bug_gensigsegv_amd64.s (9.02 KB) solaris11_restorecontext_bug_gensigsegv_amd64.s SIGSEGV generating assembler routine (x86) Richard Reingruber, 2016-12-02 02:14 PM
solaris11_restorecontext_bug_gensigsegv_sparc.s (9.04 KB) solaris11_restorecontext_bug_gensigsegv_sparc.s SIGSEGV generating assembler routine (sparc) Richard Reingruber, 2016-12-02 02:14 PM
README.txt (75 Bytes) README.txt Richard Reingruber, 2016-12-02 02:14 PM
build.sh (330 Bytes) build.sh Richard Reingruber, 2016-12-02 02:14 PM
solaris11_restorecontext_bug.d (1.51 KB) solaris11_restorecontext_bug.d dtrace script Richard Reingruber, 2016-12-02 02:14 PM
solaris11_restorecontext_bug.c (6.53 KB) solaris11_restorecontext_bug.c test sources Richard Reingruber, 2016-12-02 02:14 PM
solaris11_restorecontext_bug (15 KB) solaris11_restorecontext_bug x86 executable generated on Solaris 11 with Sun Studio 11 Richard Reingruber, 2016-12-02 02:14 PM
dtrace.zip (15.6 KB) dtrace.zip dtrace output Richard Reingruber, 2016-12-02 02:14 PM
joyent-test.tar (7.5 KB) joyent-test.tar Smaller test case from Joyent OS-5834 Jason King, 2020-04-04 11:40 PM

History

#1

Updated by Robert Mustacchi over 3 years ago

Thank you for reporting this and providing the detailed analysis and reproduction code. We'll dig into this.

#2

Updated by Patrick Mooney over 3 years ago

After tracing this out for a while, I believe I've found the crux of the issue. The comment for schedctl_finish_sigblock notes that it should either be called by curthread or, if in an "external" thread, while holding p_lock. While the p_lock requirement does synchronize external callers, it does not address the possibility that signal-related code in curthread will race when updating t_hold. The test program achieves this by hammering schedctl_finish_sigblock via the procfs facilities for querying LWP status.

#3

Updated by Patrick Mooney over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Patrick Mooney
#4

Updated by Patrick Mooney over 3 years ago

Opened as OS-5834 for the purpose of burn-in downstream before integration into the gate.

#5

Updated by Patrick Mooney over 3 years ago

Merged downstream as 060157c

#6

Updated by Jason King 3 months ago

Attaching the smaller test case from OS-5834

#7

Updated by Jason King 3 months ago

From Joyent OS-5834:

After tracing this out for a while, I believe I've found the crux of the issue. The comment for schedctl_finish_sigblock notes that it should either be called by curthread or, if in an "external" thread, while holding p_lock. While the p_lock requirement does synchronize external callers, it does not address the possibility that signal-related code in curthread will race when updating t_hold. The test program achieves this by hammering schedctl_finish_sigblock via the procfs facilities for querying LWP status.

#8

Updated by Jason King 3 months ago

I re-ran the Joyent test case -- on a PI without the fix, it crashes as expected, with the fix in place the test case no longer crashes.

#9

Updated by Electric Monk 3 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 72a6dc127431d372b6b6136087c736300544f8b7

commit  72a6dc127431d372b6b6136087c736300544f8b7
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2020-04-09T00:41:51.000Z

    7637 restorecontext(ucontext_t *ucp) leaves all maskable signals blocked in curthread->t_hold
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF