Bug #15254
open%ymm registers not restored after signal handler
0%
Description
update: some of the instances previously attributed to this bug are likely caused by https://www.illumos.org/issues/15367 instead.
(The salient details and test program are mirrored at https://github.com/davepacheco/illumos-ymm .)
While debugging an issue with the Go runtime and CockroachDB, I began to suspect that there might be a problem with the handling of %ymm registers after a signal handler was run. Robert confirmed this seemed likely so I wrote a test program to verify it. I apologize in advance that this could probably be a lot simpler, but I wanted to be sure it was clear and correct, so there's a lot of extra code to read and print %ymm0.
Here's main.c:
/* * test program to examine the behavior of preserving ymm registers across * signal handlers */ #include <errno.h> #include <stdio.h> #include <unistd.h> #include <signal.h> #include <strings.h> // set by the signal handler to wake up main() static volatile int gotsig; // signal handler static void handle_sig(int, siginfo_t *, void *); // functions defined in asm.s void read_ymm0(uint64_t val[4]); void write_ymm0(uint64_t val[4]); int main(int argc, char *argv[]) { uint64_t initial_value[4] = { 1, 2, 3, 4 }; uint64_t test_value[4]; struct sigaction sa_sig; // Register signal handler that will clobber %ymm0. // SIGINT is more convenient for interactive use. // SIGUSR1 is more convenient when run under mdb. bzero(&sa_sig, sizeof (sa_sig)); sa_sig.sa_sigaction = handle_sig; sa_sig.sa_flags = SA_SIGINFO; (void) sigaction(SIGINT, &sa_sig, NULL); (void) sigaction(SIGUSR1, &sa_sig, NULL); (void) sigaction(SIGUSR2, &sa_sig, NULL); // Write a known initial value to %ymm0. write_ymm0(&initial_value[0]); // Read it back and print it out. bzero the test_value first so that we // know we've actually successfully read the right thing. bzero(test_value, sizeof (test_value)); (void) printf( "main: test value (expect zeros): 0x%lx 0x%lx 0x%lx 0x%lx\n", test_value[0], test_value[1], test_value[2], test_value[3]); read_ymm0(&test_value[0]); (void) printf("main: %%ymm0: 0x%lx 0x%lx 0x%lx 0x%lx\n", test_value[0], test_value[1], test_value[2], test_value[3]); // Wait patiently for a signal to arrive. printf("main: waiting for SIGINT, SIGUSR1, or SIGUSR2\n"); for (;;) { while (gotsig == 0) { uint8_t c; int r = read(0, &c, 1); if (r == 0) { // Stop on end-of-input. goto done; } if (r == -1 && errno != EINTR) { // Bail on any unexpected error. perror("read"); goto done; } } // Print what we found in %ymm0. (void) printf("main: saw signal\n"); read_ymm0(&test_value[0]); (void) printf("main: %%ymm0: 0x%lx 0x%lx 0x%lx 0x%lx\n", test_value[0], test_value[1], test_value[2], test_value[3]); // Reset %ymm0 so that we can try again. (void) printf("main: resetting ymm0\n"); write_ymm0(&initial_value[0]); read_ymm0(&test_value[0]); (void) printf("main: %%ymm0: 0x%lx 0x%lx 0x%lx 0x%lx\n", test_value[0], test_value[1], test_value[2], test_value[3]); gotsig = 0; } done: return (0); } // Handles SIGINT, SIGUSR1, or SIGUSR2. // On SIGINT or SIGUSR1, clobbers %ymm0. Otherwise, does basically nothing. static void handle_sig(int sig, siginfo_t *sip, void *arg) { char buf[256]; int nbytes; int doclobber; uint64_t found_value[4]; // Read %ymm0 and print what we found. read_ymm0(&found_value[0]); doclobber = sig == SIGINT || sig == SIGUSR1; nbytes = snprintf(buf, sizeof (buf), "\nhandle_sig: found %%ymm0 = 0x%lx 0x%lx 0x%lx 0x%lx (%s)\n", found_value[0], found_value[1], found_value[2], found_value[3], doclobber ? "CLOBBER" : "NO clobber"); (void) write(STDOUT_FILENO, buf, nbytes); // If this was SIGINT or SIGUSR1, clobber %ymm0. Print it out again so // we know we got it right. if (doclobber) { uint64_t clobber_value[4] = { 5, 6, 7, 8 }; write_ymm0(&clobber_value[0]); read_ymm0(&found_value[0]); nbytes = snprintf(buf, sizeof (buf), " now %%ymm0 = 0x%lx 0x%lx 0x%lx 0x%lx\n", found_value[0], found_value[1], found_value[2], found_value[3]); (void) write(STDOUT_FILENO, buf, nbytes); } // Tell main() to proceed. gotsig = sig; }
Here's an asm file with functions I'm using to read and write directly to %ymm0:
.text .globl read_ymm0 .type read_ymm0, @function read_ymm0: pushq %rbp movq %rsp, %rbp vmovdqu %ymm0,(%rdi) leave ret .globl write_ymm0 .type write_ymm0, @function write_ymm0: pushq %rbp movq %rsp, %rbp vlddqu (%rdi),%ymm0 leave ret
Build with:
$ gcc -march=native -m64 -Wall -o checkregs main.c asm.s
Start it up:
$ gcc -march=native -m64 -Wall -o checkregs main.c asm.s && ./checkregs main: test value (expect zeros): 0x0 0x0 0x0 0x0 main: %ymm0: 0x1 0x2 0x3 0x4 main: waiting for SIGINT, SIGUSR1, or SIGUSR2
Now hit it with SIGINT:
^C handle_sig: found %ymm0 = 0xff0000000000 0x0 0x3 0x4 (CLOBBER) now %ymm0 = 0x5 0x6 0x7 0x8 main: saw signal main: %ymm0: 0xff000000000000 0x0 0x3 0x4 main: resetting ymm0 main: %ymm0: 0x1 0x2 0x3 0x4
In that case, `main` found a different %ymm0 than it had before. The low 128
bits are the same, but the high 128 bits are different. Then it reset %ymm0 so
we can try it again:
^C handle_sig: found %ymm0 = 0x1 0x2 0x3 0x4 (CLOBBER) now %ymm0 = 0x5 0x6 0x7 0x8 main: saw signal main: %ymm0: 0xff000000000000 0x0 0x3 0x4 main: resetting ymm0 main: %ymm0: 0x1 0x2 0x3 0x4
Same story, though this time the signal handler at least found the value that
`main()` had left.
Now let's do it with SIGUSR2, whose handler doesn't write to %ymm0:
handle_sig: found %ymm0 = 0x1 0x2 0x3 0x4 (NO clobber) main: saw signal main: %ymm0: 0xff000000000000 0x0 0x3 0x4 main: resetting ymm0 main: %ymm0: 0x1 0x2 0x3 0x4
Same story. Again:
handle_sig: found %ymm0 = 0x1 0x2 0x3 0x4 (NO clobber) main: saw signal main: %ymm0: 0xff000000000000 0x0 0x3 0x4 main: resetting ymm0 main: %ymm0: 0x1 0x2 0x3 0x4
Same story.
Related issues
Updated by Robert Mustacchi 3 months ago
- Category set to kernel
- Difficulty changed from Medium to Hard
Updated by Robert Mustacchi 3 months ago
This is fundamentally a problem in the kernel's process of saving and restoring the floating point state in the kernel. Let's go into some detail about how this works on x86 and then what's going on. For additional background see uts/intel/os/fpu.c.
x86 has had various ways of saving and restoring the floating point registers over the years. When amd64 came out, it mandated support for the full XMM state which was implemented with the fxsave instruction. The 32-bit ABI was different and assumed different things over the years, we'll come back to that. For a while this was all good and there were no vector registers other than the 16 XMM registers available in 64-bit processes (already a large jump).
Eventually, Intel wanted to increase the vector size which came about with the AVX instruction set, which turned the 16 128-bit XMM registers into 16 256-bit YMM registers. AVX-512 extended this to 16 512-bit ZMM registers where the various execution states overlap one another. That is xmm0 overlaps with the low 128-bits of ymm0 and zmm0. This also occurs for ymm0. To save this state, Intel added the xsave instruction, which is what we use in the kernel for saving and restoring the FPU.
Unfortunately, there is a gap here. When a signal occurs, we currently generate a ucontext_t and save that to the stack of the interrupted thread. The ucontext_t on amd64 currently contains the traditional amd64 ABI fpregset_t which basically contains the fxsave data. This is later restored through a call to setcontext, which calls setfpregs(). This itself saves the current fpu state, updates the fpregset_t portion of it which only corresponds to a portion of the xsave state and restores it.
Let's get into a bit more detail here. The xsave state first starts with the traditional xsave state and then has a bunch of flags and other regions that follow that correspond to a large suite of either user or supervisor states that range from the ymm registers, the upper half zmm registers, various mask registers for AVX-512, etc. each of which has a corresponding flag to indicate that they are modified or not. Because the signal handler borrows our existing registers and we're not saving the upper bits ahead of the signal handler, we will end up with whatever the signal handler modified them with.
Put another way, if code that the signal handler executes uses any ymm/zmm state then we will corrupt what was previously there and end up in a bad place. So the long story short is we need to correctly save and restore the full xsave state while signal handling and in setcontext().
Doing that correctly and around the ABI is... nuanced. I will write up an IPD around this to further explain the specifics of how this works.
Updated by David Pacheco 3 months ago
In case it's useful, a more detailed analysis of how I got to this issue is in https://github.com/oxidecomputer/omicron/issues/1146.
As far as I'm aware, this issue would affect any Go programs at least in the range of v1.16 to v1.19. The effect is potentially substantial memory corruption because %ymm0 is used while zeroing various data structures, and Go uses signals extensively for preempting Goroutines. The problem seems to be reliably avoided by setting GODEBUG=asyncremptoff=1 (see https://pkg.go.dev/runtime#hdr-Environment_Variables), which disables this heavy use of signals. That doesn't eliminate the possibility of corruption but it appears to drastically reduce it. I think that's a reasonable workaround; as I understand it, the cost is potentially increased latency (since Goroutines are allowed to run longer). In principle, there are pathological cases (e.g., Goroutines trying to hog the CPU.)
Updated by David Pacheco 3 months ago
The test program above is not quite valid because it calls functions (like printf) in between writing %ymm0 and reading it back. Those functions are allowed to use %ymm0. I have updated https://github.com/davepacheco/illumos-ymm with a version that writes %ymm0 and then busy-waits until it changes. It shows the same problem. Thanks to Robert for pointing out this problem.
Updated by Robert Mustacchi 3 months ago
So there's an interesting nuance here that occurs and also explains partly why this hasn't been as bad an issue or hits us all the time.
When we're in setfpregs()
which is the function that has the implementation of setting the traditional fxsave state used in the signal handler and in setcontext()
more generally, we first must synchronize the state of the hardware FPU with what we are updating. When we're normally switching processes, we'll call fpsave()
which effectively is calling xsave
under the hood, while also setting TS and making sure that the state is marked as valid. Critically we aren't doing that here. Instead we call fp_free()
.
This has the effect of basically marking the FPU state valid and clearing any exceptions in the hardware. But critically it does not actually call xsave
or fxsave
. In addition, when we do this, we set the xsave header's component set by ORing in the x87 and SSE bits. This means that when we are updating state, we will update the hardware FPU to:
- The active contents of the fxsave region
- We will reset everything else to whatever was last saved
So in the original case here, what was last saved in the ymm portion would have been the 0x3/0x4, so it was getting restored by fluke. However, if we were to context switch off CPU, then we would update the FPU save state to whatever is actually inside it during that time. We do do a save when we take a signal / savecontext. So this means that this manifests when we take a signal, modify FPU state, and then go off CPU to commit the new version (or something else happens).
Updated by David Pacheco about 2 months ago
- Description updated (diff)
While testing the fix for this, we found that some of the problems previously believed to be caused by this issue were still happening with the fix in place. Robert filed https://www.illumos.org/issues/15367 describing a different (but closely related) issue.
Updated by Robert Mustacchi about 2 months ago
- Related to Bug #15367: x86 getfpregs() summons corrupting %xmm ghosts added