Bug #15881
openasy close logic is dubious at best
0%
Description
While debugging an unrelated bug in a clone of asy (itself cloned, like all UART drivers, from zs, which may well have been cloned from the very first UART driver that lived in the garden), I noticed that asyclose has some dubious logic and even more dubious comments associated with clearing ASYNC_ISOPEN
and synchronising that with qprocsoff
/qprocson
and interrupts.
The first and perhaps most serious problem is that ASYNC_ISOPEN
is never cleared until just before we drop the lock and return, and indeed CAN never be cleared until then in this or any other function, but our logic for disabling the receiver interrupt assumes that it can be:
» /* » »* If nobody's using it now, turn off receiver interrupts. » »*/ » if ((async->async_flags & (ASYNC_WOPEN|ASYNC_ISOPEN)) == 0) { » » icr = ddi_get8(asy->asy_iohandle, asy->asy_ioaddr + ICR); » » ddi_put8(asy->asy_iohandle, asy->asy_ioaddr + ICR, » » » (icr & ~RIEN)); » } » mutex_exit(&asy->asy_excl_hi); out: » ttycommon_close(&async->async_ttycommon); » /* » »* Cancel outstanding "bufcall" request. » »*/ » if (async->async_wbufcid != 0) { » » unbufcall(async->async_wbufcid); » » async->async_wbufcid = 0; » } » /* Note that qprocsoff can't be done until after interrupts are off */ » qprocsoff(q); » q->q_ptr = WR(q)->q_ptr = NULL; » async->async_ttycommon.t_readq = NULL; » async->async_ttycommon.t_writeq = NULL; » /* » »* Clear out device state, except persistant device property flags. » »*/ » async->async_flags &= (ASYNC_DTR_DELAY|ASY_RTS_DTR_OFF); » cv_broadcast(&async->async_flags_cv); » mutex_exit(&asy->asy_excl); » DEBUGCONT1(ASY_DEBUG_CLOSE, "asy%dclose: done\n", instance); » return (0);
One interesting side effect of this is that in the panic path, high-level interrupts will continue to fire even if the device isn't open, each one removing one and only one byte (even if FIFOs are enabled!) from the UART receiver. That's mostly harmless when we're running normally, but if something in the kernel's panic path wants to take over the UART and use it in polled mode, the interrupt handler will fight with the polled receiver for incoming data. That's how we found this; there is to my knowledge no analogous code in the gate and we've addressed that problem in another way. The only observable harm done on i86pc would seem to be CPU time wasted running interrupts that the driver thought it was going to disable because no one is using the device. At the very best, we have incorrect comments and dead code, but there seems to be a clear intent to disable this interrupt when the device isn't in use. One possible caveat: the interrupt might be necessary on a console device to receive the break sequence and enter the debugger (though the console device is probably kept open forever).
In addition to this, we have an unused label (out
), and the comments describing the relationship between clearing ASYNC_ISOPEN
and calling qprocsoff
do not match either reality or the comments in asyopen
. We know that if we take a soft interrupt with ASYNC_ISOPEN
(and/or ASYNC_WOPEN
) set, we will call canput
, which requires that qprocson
have been called and qprocsoff
not. The comments in the open path therefore seem reasonable. But the comment here in the close path is not: because we could not possibly have cleared ASYNC_ISOPEN
at this point, the clearing of RIEN
in the ICR is dead code. That means we're certainly calling qprocsoff
with interrupts still firing away. Why is that ok? Well, it's possible that it isn't and there's still a race here, but I have to believe we'd have hit it by now. It appears instead that because we're still holding asy_excl
here for this entire set of operations, the softint handler can never see the queues torn down while ASYNC_ISOPEN
is set. Good thing, too, first because even if we stop the receiver interrupt, the high-level interrupt handler might still be running and could trigger a softint which is where canput
gets called. So the comments are certainly wrong, and should instead describe which mutex protects the queue state and async_flags
. But none of this changes the fact that the receiver interrupt is never turned off until the driver detaches or someone ioctls us to take us through an asy_program
path with CREAD
clear.
Note also that in the open path, we drop all the locks, then call qprocson
and finally set ASYNC_ISOPEN
. The comment here is correct but the logic is not:
» async->async_ttycommon.t_readq = rq; » async->async_ttycommon.t_writeq = WR(rq); » rq->q_ptr = WR(rq)->q_ptr = (caddr_t)async; » mutex_exit(&asy->asy_excl); » /* » »* Caution here -- qprocson sets the pointers that are used by canput » »* called by async_softint. ASYNC_ISOPEN must *not* be set until those » »* pointers are valid. » »*/ » qprocson(rq); » async->async_flags |= ASYNC_ISOPEN; » async->async_polltid = 0; » DEBUGCONT1(ASY_DEBUG_INIT, "asy%dopen: done\n", unit); » return (0);
We have already run through asy_program
here and if CREAD
is set, RIEN
will also be set in the ICR, allowing interrupts to fire. As soon as we drop the last lock here, an interrupt could indeed fire and would see ASYNC_ISOPEN
clear and the queues not set up; the call to canput
would panic. That seems ok, because ASYNC_ISOPEN
is clear... but it's not. The comment is telling us that we have a memory ordering problem: it is possible for the store to async_flags
to become visible to the CPU on which the interrupt is firing before the stores to the pointers done by qprocson
-- we're not holding a lock here, the addresses targeted by these stores aren't generally in the same cache line, and there is no memory barrier. If it's really necessary to drop the mutex before doing these operations, a barrier is required to force visibility of the other changes before the store to async_flags
(and possibly also async_polltid
; I didn't look closely at it).
This is all, as the synopsis would have it, dubious at best. At an absolute minimum, existing comments and dead code should be removed and replaced with something accurate. If the intent really is to disable the receiver interrupt, that should be done and tested carefully. Either the existing code violates the synchronisation rules for async_flags
that should be documented or it is a special case that requires a proper comment and a memory barrier. The unused label should be removed.
No data to display