mlxcx panicked in mac_link_update()
A SmartOS user hit a panic and shared the following screenshot: https://pasteboard.co/JVjZIOP.jpg. It showed the system panicking by dereferencing 0x9c in
mac_link_update, coming from mlxcx. As Dan McDonald noted, the offset that the code was dereferencing from the asm instruction
mac_link_update+1: movl %esi,0x9c(%rdi) and the panic were the same, suggesting we had a NULL mac_handle_t. Looking at the driver source code for mlxcx there are two ways that this could happen which tie into the following observation: the mac handle is the last thing created and registered in attach and the first thing to go in detach. With that, there are two ways this could have happened (which is hard to say due to the lack of a dump):
1. Because we create the taskq and enable interrupts, it is possible that we get a asynchronous interrupt before we call
mac_register() and thus get to
mlxcx_update_link_state() with a NULL handle.
2. Because the taskq that is processing the link state event is being done asynchronously, that happened and then we called
detach(). In this case, this would be a window between detach unregistering with mac and then this happening before the call to
To fix both of these, we need two things:
1. We should make
mlxcx_update_link_state() gate itself on having a valid mac handle.
2. The teardown path should issue a taskq_wait() to make sure that all outstanding elements of the asynch taskq have finished processing. Just doing (1) is insufficient because we could easily have a problem where the mlxcx_t itself was bad (though we'd panic a bit sooner).