Bug #4285
closedpreemption leaves a context op behind
100%
Description
Ken Harris had long been dealing with an issue regarding FPU corruption on x86 systems on smartos-discuss. He was able to finally catch it red handed and included the following information:
I think the problem is this:
freectx() loops through a linked-list of ctxop pointers held by the lwp. It removes a given ctxop pointer from the linked-list, then calls its free function. The problem is that if the free function is preempted after it has been removed from the linked list but before it carries out fpdisable(), then savectx (invoked by resume()) will not call the fpu_save function as it is no longer present in the ctxop linked list -- thus CR0.TS will be left clear leading into the resumed thread. Here's a stack trace demonstrating this:
vpanic() 0xfffffffffbaf8a3f() <<<<< This is savectx() with an assert CR0.TS is set just before it would return. resume+0x60() swtch+0x141() preempt+0xec() kpreempt+0x98(1) sys_rtt_common+0x1ba(ffffff008db2ab40) _sys_rtt_ints_disabled+8() fp_free+0x17(ffffff1588388880, 1) freectx+0x4e(ffffff1406c61780, 1) exec_common+0x5ef(49f639, 49f550, 49f5b8, 0) exece+0x1b(49f639, 49f550, 49f5b8) sys_syscall+0x17a()
His analysis is spot on. We have basically lost this, not reset CR0.TS and thus the next thread that's on CPU and uses the FPU will get very unlucky. The solution is pretty straightforward. We need to disable preemption in these and related cases where we're removing something from the list. Specifically we need to look at freectx, removectx, freectx_ctx, freepctx, and removepctx.
Updated by Robert Mustacchi about 10 years ago
- Status changed from Pending RTI to Resolved
Resolved in 7ac89421b5a801d998dd60ee8cd0684bc6e83fcd.