Bug #14159
closedahci can deadlock with sata framework
100%
Description
We had a system that would have the zfs deadman seemingly randomly fire on the systems boot zpool. Attempting to do a crash dump failed since I/O to the mirrored boot drives seemed to be hung. We were able to get kmdb loaded and capture the state of the system after this occurred. Unfortunately, the serial console was not configured, so there are only screenshots of the kmdb output (though I'm attempting to reproduce w/ the serial console configured so I can copy/paste the kmdb analysis and will update if possible).
Looking at threads in ahci (::stacks -m ahci), I saw three threads. One thread was servicing an ahci interrupt and had a call stack containing (I'm abbreviating this w/ the key bits since I'm having to copy it by hand):
mutex_vector_enter sata`sata_scsi_init_pkt scsi`scsi_init_pkt ... scsi`scsi_hba_pkt_comp sata`sata_txlt_rw_completion ahci`flush_doneq ahci`ahci_intr_ncq_events ahci`ahci_intr_set_device_bits ahci`ahci_port_intr ahci`ahci_intr
The mutex in question is from this (in sata_scsi_init_pkt
):
mutex_enter(&(SATA_CPORT_MUTEX(sata_hba_inst,
sata_device.satadev_addr.cport)));
Examining who held the mutex, it was a thread with the following call stack (again abbreviated):
... cv_wait ahci`ahci_do_sync_start ahci`ahci_tran_start sata`sata_fetch_device_identify_data sata`sata_txlt_inquiry sata`sata_scsi_start scsi`scsi_transport sd`sd_start_cmds sd`sd_restore_throttle ... taskq_thread
The above mentioned mutex is acquired in sata_txlt_inquiry
and is held for effectively the entire function. Note that the mutex associated with the blocked CV is not the cport_mutex. We are blocked due to the following code:
while (spkt->satapkt_reason == SATA_PKT_BUSY)
cv_wait(&ahci_portp->ahciport_cv,
&ahci_portp->ahciport_mutex);
So the command is waiting for completion with the same sata framework cport_mutex
being held that the ISR is blocked on. Eventually, the ahci watchdog thread attempts to time out the request, it resets the port
mutex_vector_enter sata`sata_hba_event_notify ahci`ahci_restart_port_wait_till_ready ahci`ahci_timeout_pkts ahci`ahci_watchdog_handler
In this case the watchdog handler has reset the port, and is blocked on the sata framework on the same cport_mutex
trying to notify the sata framework that the port has been reset, so it also cannot make progress and clean things up (i.e. complete the synchronous request as failed/timed out).
I believe the error is sata_txlt_inquiry
holding cport_mutex
while calling into the HBA (the sata_tran_start
method i.e. ahci_tran_start
). No other callers in the SATA framework do this -- either they just don't have the mutex held, or they explicitly drop it around the call into the HBA). We (unfortunately) don't have any sort of comments or docs to point at to definitely say it's wrong, but the evidence is highly suggestive that is the correct way to resolve the deadlock.
Files
Updated by Marcel Telka 8 months ago
I think the problem might be missing mutex_exit/mutex_enter
calls around this code in ahci_restart_port_wait_till_ready()
:
8763 sata_hba_event_notify(ahci_ctlp->ahcictl_dip, &sdevice, 8764 SATA_EVNT_DEVICE_RESET);
I tried to add those calls more than 8 years ago when I worked on #3839, but such a change was refused: https://illumos.topicbox.com/groups/developer/T0bbacadf3d9c767b
Updated by Jason King 8 months ago
- Assignee set to Jason King
I'm in the process of testing/verifying a fix, but I think the immediate issue is that if something issues an INQUIRY for the ATA INFORMATION page to the device, the sata framework's cport_mutex
is held across the call into the HBA. All the other callers of sata_fetch_device_identify_data
(or more specifically anyone invoking the HBA's start (sata_hba_start
) method invoke the start function without that lock being held -- either they explicitly drop the lock themselves, or their callers do so.
Updated by Jason King 8 months ago
That said, the existing sata event handling code leaves much to be desired and could use improvement.
Updated by Jason King 7 months ago
The plan is to fix the deadlock as a part of this change. Any other bits will have their own ticket.
Updated by Garrett D'Amore 7 months ago
So Marcel, I'm the one who blocked the change, and let me be the first to say I was wrong to do so, although I understand why I did.
I've since reviewed the code in rather gory detail, and I'm not super happy about this SATA framework code at all, but this one change (fixing the locks) is actually needed.
I have also written a change (which we may or may not upstream) which moves the notify stuff out of the synchronous path and into a taskq. I think that is more or less what I was suggesting 8 years ago, but I seem to have arrived at the same solution "independently" (of myself?) 8 years later on.
Anyway, the change is in our tree, and so far testing seems to hold up on it.
Updated by Jason King 7 months ago
- File sata.d sata.d added
- File write-file.c write-file.c added
- File inquiry-ata.c inquiry-ata.c added
Testing this has been a bit challenging due to just how tight the window is. However using the attached files, I've had some amount of success.
Using sata.d
, I attempted to increase the window where a deadlock occurs.
At the same time, I used write-file.c
to just write a 1MB block of random data to a file, flush, and start all over (overwriting the 1MB -- the intention is to generate write I/O and not have to worry about exhausting space). While all of that is running, ATA information requests are sent to the device.
With all of that running, I was able to produce the lockup a couple of times over the course of the day on an unpatched system. While it would be nice to be able to replicate it somewhat quicker and easier, it appears that NCQ I/Os (basically reads and writes) have the effect of starving out non-NCQ I/Os (such as the ATA INFORMATION request). As a result, it meant requiring having the AHCI ISR service an NCQ I/O completion (with no other NCQ I/Os outstanding), then in the small window where the ISR marks the slot available and the I/O callbacks are invoked, have the ATA Information go through without getting 'preempted' (for lack of a better term) by another NCQ I/O. Even with various attempts using chill()
with dtrace, it proved to be challenging.
In comparison though, with this fix in place, the same setup that produced several hangs within a day ran for 3+ days non-stop without causing a system hang -- system responsiveness wasn't great, but considering it was slowing down I/O servicing on the root pool of the system, while inducing a fair amount of constant write I/O, that's not too surprising.
This is in addition to examining the other callers to sata_fetch_device_identify_data
which all call it without cport_mutex
being held -- which itself is a strong indicator that the ATA Information request should also follow suit -- obviously the stacks seen in the deadlock provide even stronger proof.
Updated by Electric Monk 6 months ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit df07a22216251845ccadf35a5acc27bfa0b1a4d1
commit df07a22216251845ccadf35a5acc27bfa0b1a4d1 Author: Jason King <jason.brian.king@gmail.com> Date: 2021-12-01T01:54:11.000Z 14159 ahci can deadlock with sata framework Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Dan McDonald <danmcd@joyent.com>