ahci can deadlock with sata framework
We had a system that would have the zfs deadman seemingly randomly fire on the systems boot zpool. Attempting to do a crash dump failed since I/O to the mirrored boot drives seemed to be hung. We were able to get kmdb loaded and capture the state of the system after this occurred. Unfortunately, the serial console was not configured, so there are only screenshots of the kmdb output (though I'm attempting to reproduce w/ the serial console configured so I can copy/paste the kmdb analysis and will update if possible).
Looking at threads in ahci (::stacks -m ahci), I saw three threads. One thread was servicing an ahci interrupt and had a call stack containing (I'm abbreviating this w/ the key bits since I'm having to copy it by hand):
mutex_vector_enter sata`sata_scsi_init_pkt scsi`scsi_init_pkt ... scsi`scsi_hba_pkt_comp sata`sata_txlt_rw_completion ahci`flush_doneq ahci`ahci_intr_ncq_events ahci`ahci_intr_set_device_bits ahci`ahci_port_intr ahci`ahci_intr
The mutex in question is from this (in
Examining who held the mutex, it was a thread with the following call stack (again abbreviated):
... cv_wait ahci`ahci_do_sync_start ahci`ahci_tran_start sata`sata_fetch_device_identify_data sata`sata_txlt_inquiry sata`sata_scsi_start scsi`scsi_transport sd`sd_start_cmds sd`sd_restore_throttle ... taskq_thread
The above mentioned mutex is acquired in
sata_txlt_inquiry and is held for effectively the entire function. Note that the mutex associated with the blocked CV is not the cport_mutex. We are blocked due to the following code:
while (spkt->satapkt_reason == SATA_PKT_BUSY) cv_wait(&ahci_portp->ahciport_cv, &ahci_portp->ahciport_mutex);
So the command is waiting for completion with the same sata framework
cport_mutex being held that the ISR is blocked on. Eventually, the ahci watchdog thread attempts to time out the request, it resets the port
mutex_vector_enter sata`sata_hba_event_notify ahci`ahci_restart_port_wait_till_ready ahci`ahci_timeout_pkts ahci`ahci_watchdog_handler
In this case the watchdog handler has reset the port, and is blocked on the sata framework on the same
cport_mutex trying to notify the sata framework that the port has been reset, so it also cannot make progress and clean things up (i.e. complete the synchronous request as failed/timed out).
I believe the error is
cport_mutex while calling into the HBA (the
sata_tran_start method i.e.
ahci_tran_start). No other callers in the SATA framework do this -- either they just don't have the mutex held, or they explicitly drop it around the call into the HBA). We (unfortunately) don't have any sort of comments or docs to point at to definitely say it's wrong, but the evidence is highly suggestive that is the correct way to resolve the deadlock.
Updated by Marcel Telka 11 days ago
I think the problem might be missing
mutex_exit/mutex_enter calls around this code in
8763 sata_hba_event_notify(ahci_ctlp->ahcictl_dip, &sdevice, 8764 SATA_EVNT_DEVICE_RESET);
I tried to add those calls more than 8 years ago when I worked on #3839, but such a change was refused: https://illumos.topicbox.com/groups/developer/T0bbacadf3d9c767b
Updated by Jason King 11 days ago
- Assignee set to Jason King
I'm in the process of testing/verifying a fix, but I think the immediate issue is that if something issues an INQUIRY for the ATA INFORMATION page to the device, the sata framework's
cport_mutex is held across the call into the HBA. All the other callers of
sata_fetch_device_identify_data (or more specifically anyone invoking the HBA's start (
sata_hba_start) method invoke the start function without that lock being held -- either they explicitly drop the lock themselves, or their callers do so.
Updated by Garrett D'Amore 10 days ago
So Marcel, I'm the one who blocked the change, and let me be the first to say I was wrong to do so, although I understand why I did.
I've since reviewed the code in rather gory detail, and I'm not super happy about this SATA framework code at all, but this one change (fixing the locks) is actually needed.
I have also written a change (which we may or may not upstream) which moves the notify stuff out of the synchronous path and into a taskq. I think that is more or less what I was suggesting 8 years ago, but I seem to have arrived at the same solution "independently" (of myself?) 8 years later on.
Anyway, the change is in our tree, and so far testing seems to hold up on it.
Updated by Jason King about 4 hours ago
- File sata.d sata.d added
- File write-file.c write-file.c added
- File inquiry-ata.c inquiry-ata.c added
Testing this has been a bit challenging due to just how tight the window is. However using the attached files, I've had some amount of success.
sata.d, I attempted to increase the window where a deadlock occurs.
At the same time, I used
write-file.c to just write a 1MB block of random data to a file, flush, and start all over (overwriting the 1MB -- the intention is to generate write I/O and not have to worry about exhausting space). While all of that is running, ATA information requests are sent to the device.
With all of that running, I was able to produce the lockup a couple of times over the course of the day on an unpatched system. While it would be nice to be able to replicate it somewhat quicker and easier, it appears that NCQ I/Os (basically reads and writes) have the effect of starving out non-NCQ I/Os (such as the ATA INFORMATION request). As a result, it meant requiring having the AHCI ISR service an NCQ I/O completion (with no other NCQ I/Os outstanding), then in the small window where the ISR marks the slot available and the I/O callbacks are invoked, have the ATA Information go through without getting 'preempted' (for lack of a better term) by another NCQ I/O. Even with various attempts using
chill() with dtrace, it proved to be challenging.
In comparison though, with this fix in place, the same setup that produced several hangs within a day ran for 3+ days non-stop without causing a system hang -- system responsiveness wasn't great, but considering it was slowing down I/O servicing on the root pool of the system, while inducing a fair amount of constant write I/O, that's not too surprising.
This is in addition to examining the other callers to
sata_fetch_device_identify_data which all call it without
cport_mutex being held -- which itself is a strong indicator that the ATA Information request should also follow suit -- obviously the stacks seen in the deadlock provide even stronger proof.