Bug #15328
openFPU_CW_INIT mistreats reserved bit
0%
Description
As part of testing for unrelated FPU bits, it was found that bit 6 of the FPU Control Word (fcw
) was being cleared in some situations. This could be seen by walking the xsave cache:
> ::walk xsave_cache | ::print struct fxsave_state fx_fcw ! sort | uniq -c 305 fx_fcw = 0 126 fx_fcw = 0x133f 65 fx_fcw = 0x137f
The entries with fx_fcw
are strange here, since they are clearly not empty, but are also missing bit 6. The Intel SDM (and AMD's equivalent) both list that bit as Reserved. One might jump to the conclusion that it must therefore be unset, but the default value(s) for the fcw
tell a different story:
When the x87 FPU is initialized with either an FINIT/FNINIT or FSAVE/FNSAVE instruction, the x87 FPU control
word is set to 037FH...
The table for the power-up state (and post-reset) state for that register is 0x0040
, with the bit also set.
If the bit is Reserved as set, then we should be preserving that value when manipulating the FPU or its save states. Some cursory investigation as to why the bit may have been cleared leads us to an old initialization constant:
uts/intel/sys/fp.h
:
/*
* Initial value of FPU control word as per 4th ed. ABI document
* - affine infinity
* - round to nearest or even
* - 64-bit double precision
* - all exceptions masked
*/
#define FPU_CW_INIT 0x133f
This constant (and one site where that same raw value is used) forms the basis for how we've inappropriately been clearing that bit in FPU save states.
While it is awfully nice that neither Intel nor AMD CPUs have thrown an error in the face of our fumbling of this reserved bit, we should probably fix it.
Updated by Patrick Mooney 21 days ago
As part of the research for this, especially in the confusion around bit 6 (and how its 1-value is not documented in a particularly verbose manner), it was noted that the bochs i386 emulator explicitly keeps that bit set when the fcw
is manipulated. Like Intel and AMD, it just asserts the bit if missing, but makes no fuss about it.
Updated by Robert Mustacchi 21 days ago
To give some additional context here, I was working on #15254 and test cases. One of them basically dumped had a program set the fpu registers and get them when we hit a breakpoint we'd dump them via mdb. One notable thing about the test was that it had both a 32-bit and 64-bit variant and then we'd dump the xregs twice, once with ::tmodel lwp
and ::tmodel thread
set. At first, this provided consistent results. However, eventually I began to see the following maybe 5% of the time:
rm@iliad /tmp $ diff -u xsave/data/mdb_xregs.ymm.64 mdb_xregs.102372/fpregs.ymm.64.lwp --- xsave/data/mdb_xregs.ymm.64 Tue Jan 17 17:32:53 2023 +++ mdb_xregs.102372/fpregs.ymm.64.lwp Tue Jan 17 19:57:51 2023 @@ -53,7 +53,7 @@ %xmm15 [127:0] 0x119004d1 119004d0 119004cf 119004ce 387 and FP Control State -cw 0x137f (IM|DM|ZM|OM|UM|PM|SIG64|RTN|A) +cw 0x133f (IM|DM|ZM|OM|UM|PM|SIG64|RTN|A) sw 0x0000 (TOP=0t0) (0) xcp sw 0x0000 (0)
When I saw this it was disconcerting as I began to ask myself why was it changing and what was going on. I suspected the xregs
implementation here. However, it was weird. It only seemed to happen for 64-bit processes and not 32-bit processes in the test I ran (which only had a single victim I wrote). From here I went through and started questioning a lot of different things and had a lot of different DTrace pieces that were all a lot more than one liners. The question I was trying to answer at first was when does this change. While I don't have the output from it, here are some of the different pieces:
pfexec dtrace -n 'fbt::fpu_proc_xregs_xsave_fill:entry/execname == "mdb" && strstr(curpsinof->pr_args, "64") != NULL/{ trace(curpsinfo->pr_args); }' pfexec dtrace -n 'fbt::fpu_proc_xregs_xsave_fill:entry/execname == "mdb" && strstr(curpsinfo->pr_psargs, "64") != NULL/{ print(args[0]->fpu_regs.kfpu_u.kfpu_fx->fx_fcw); }' -n 'fbt::fpsetcw:entry/execname == "xregs_set.64"/{ printf("0x%x\n", arg0); }' -n 'fbt::fpu_signal_copyout:entry/execname == "xregs_set.64"/{ self->t = args[0]; }' -n 'fbt::fpu_signal_copyout:return/self->t/{ print(args[0]->fpu_regs.kfpu_u.kfpu_fx->fx_fcw); self->t = NULL; }' -n 'fbt::fpu_signal_copyin_xmm:entry/execname == "xregs_set.64"/{ print(args[1]->uc_mcontext.fpregs->fp_reg_set.fpchip_state.cw); }'
A particularly desperate case was when i started enabling all the syscalls to track when this would first change. There were a few that seemed deceptively plausible:
$ pfexec dtrace -n 'fbt::fpsetcw:entry/execname == "xregs_set.64"/{ self->t = 1; }' -n 'fbt::fp_save:entry/self->t/{ fpu = args[0]; self->fpu = args[0]; print(self->fpu->fpu_regs.kfpu_u.kfpu_fx->fx_fcw); }' -n 'fbt::fp_save:return/self->fpu/{ print(self->fpu->fpu_regs.kfpu_u.kfpu_fx->fx_fcw); }' -n 'fbt::fpsetcw:return/self->t/{ print(self->fpu->fpu_regs.kfpu_u.kfpu_fx->fx_fcw); self->sys = 1; }' -n 'syscall:::/fpu != NULL && fpu->fpu_regs.kfpu_u.kfpu_fx->fx_fcw == 0x137f/{ stack(); ustack(); trace(execname); exit(0); }' dtrace: description 'fbt::fpsetcw:entry' matched 1 probe dtrace: description 'fbt::fp_save:entry' matched 1 probe dtrace: description 'fbt::fp_save:return' matched 1 probe dtrace: description 'fbt::fpsetcw:return' matched 1 probe dtrace: description 'syscall:::' matched 468 probes CPU ID FUNCTION:NAME 0 7761 fp_save:entry uint16_t 0x137f 0 7762 fp_save:return uint16_t 0x137f 0 9528 fpsetcw:return uint16_t 0x133f 56 335 pread:return unix`sys_syscall+0x283 libc.so.1`__pread+0xa libproc.so.1`Pstopstatus+0x3c4 libproc.so.1`Pwait+0x12 mdb`pt_setrun+0x16a mdb`pt_continue+0xd mdb`tgt_continue+0x51e mdb`tgt_request_continue+0x5b mdb`mdb_tgt_continue+0x1b mdb`cmd_cont_common+0x93 mdb`cmd_cont+0x19 mdb`cmd_run+0x3b mdb`dcmd_invoke+0x60 mdb`mdb_call_idcmd+0x116 mdb`mdb_call+0x3a1 mdb`yyparse+0x76d mdb`mdb_run+0x308 mdb`main+0xc11 mdb`_start_crt+0x87 mdb`_start+0x18 mdb
But ultimately a bunch of these weren't leading anywhere as when I started getting other things that saw this like sshd
. Looking back on it now, I suspect some of it.
Eventually I started taking a different tact. While I had realized this was a reserved bit (and searched everywhere I could to figure out what it might have meant, but I could not find a historical meaning unlike other reserved bits), I critically misread the CPU default of 37fh
as 37h
. So the next place I began going was to step through a bunch of the program to try and see if I could pinpoint when it changed. At various points I thought I had managed to get DTrace to inject enough noise that it didn't occur, but that seems not to have been accurate.
One of the more interesting discoveries was that this was set at the start of the process -- not at main
, but in the CRT startup. This is part of what prompted the mdb -k
bit that Patrick included above and I had started looking at on all manners of systems (AMD physical systems, Intel based VMs wih AVX, etc.). I stepped through at different parts looking at a combination of ::fpregs
, using mdb -k and a ::pgrep xregs | ::walk thread | ::print kthread_t t_lwp->lwp_pcb.pcb_fpu.fpu_regs.kfpu_u.kfpu_fx->fx_fcw
. It was not adding up. So the CRT changing to what we expected of 133f
in main
made some sense because of what fpstart
did. However, I found that this was quickly reverted the moment we stepped over and resolved something in the PLT. I stepped through to an old friend from #9595 and found myself back in the amd64 boot_elf.s
code for elf_rtbndr()
. It's worth recalling that after #9595, we actually do an xsave
and xrstor
there! And it was at that xrstor
(and due to single stepping) that I began to see the kernel save sate reflect the 0x137f
value.
Now, this started leading to a lot more questions than answers (again because I misread the hardware defaults). The FPU control word actually weirdly will get inherited by a new LWP even though the initial state has it set to 0x133f
. I took a VM and set a breakpoint on fp_new_lwp
setting a breakpoint to see when the first time the copied state would have 0x137f
set. Because of the earlier observation I had mistakenly believed this was 64-bit specific, but had never actually characterized the processes that do and don't have this set. Lo and behold, this turned up that the second time we hit this, it was a problem. And what process did that mean had this bad value set, but init!!
Thankfully, Patrick was able to put together the bit that I had been missing all this time: that we were actually clearing this reserved bit that defaulted to 1! That breakthrough was the missing link that I needed as all of the other debugging while a possibly interesting journey was not going anywhere good.