Project

General

Profile

Actions

Bug #14159

open

ahci can deadlock with sata framework

Added by Jason King 12 days ago. Updated about 4 hours ago.

Status:
New
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We had a system that would have the zfs deadman seemingly randomly fire on the systems boot zpool. Attempting to do a crash dump failed since I/O to the mirrored boot drives seemed to be hung. We were able to get kmdb loaded and capture the state of the system after this occurred. Unfortunately, the serial console was not configured, so there are only screenshots of the kmdb output (though I'm attempting to reproduce w/ the serial console configured so I can copy/paste the kmdb analysis and will update if possible).

Looking at threads in ahci (::stacks -m ahci), I saw three threads. One thread was servicing an ahci interrupt and had a call stack containing (I'm abbreviating this w/ the key bits since I'm having to copy it by hand):

mutex_vector_enter
sata`sata_scsi_init_pkt
scsi`scsi_init_pkt
...
scsi`scsi_hba_pkt_comp
sata`sata_txlt_rw_completion
ahci`flush_doneq
ahci`ahci_intr_ncq_events
ahci`ahci_intr_set_device_bits
ahci`ahci_port_intr
ahci`ahci_intr

The mutex in question is from this (in sata_scsi_init_pkt):

mutex_enter(&(SATA_CPORT_MUTEX(sata_hba_inst,
          sata_device.satadev_addr.cport)));

Examining who held the mutex, it was a thread with the following call stack (again abbreviated):

...
cv_wait
ahci`ahci_do_sync_start
ahci`ahci_tran_start
sata`sata_fetch_device_identify_data
sata`sata_txlt_inquiry
sata`sata_scsi_start
scsi`scsi_transport
sd`sd_start_cmds
sd`sd_restore_throttle
...
taskq_thread

The above mentioned mutex is acquired in sata_txlt_inquiry and is held for effectively the entire function. Note that the mutex associated with the blocked CV is not the cport_mutex. We are blocked due to the following code:

    while (spkt->satapkt_reason == SATA_PKT_BUSY)
              cv_wait(&ahci_portp->ahciport_cv,
                  &ahci_portp->ahciport_mutex);

So the command is waiting for completion with the same sata framework cport_mutex being held that the ISR is blocked on. Eventually, the ahci watchdog thread attempts to time out the request, it resets the port

mutex_vector_enter
sata`sata_hba_event_notify
ahci`ahci_restart_port_wait_till_ready
ahci`ahci_timeout_pkts
ahci`ahci_watchdog_handler

In this case the watchdog handler has reset the port, and is blocked on the sata framework on the same cport_mutex trying to notify the sata framework that the port has been reset, so it also cannot make progress and clean things up (i.e. complete the synchronous request as failed/timed out).

I believe the error is sata_txlt_inquiry holding cport_mutex while calling into the HBA (the sata_tran_start method i.e. ahci_tran_start). No other callers in the SATA framework do this -- either they just don't have the mutex held, or they explicitly drop it around the call into the HBA). We (unfortunately) don't have any sort of comments or docs to point at to definitely say it's wrong, but the evidence is highly suggestive that is the correct way to resolve the deadlock.


Files

sata.d (541 Bytes) sata.d D script to widen the deadlock window Jason King, 2021-10-25 11:27 PM
write-file.c (925 Bytes) write-file.c Simple program to continuously write 1MB of data, flush in a loop Jason King, 2021-10-25 11:27 PM
inquiry-ata.c (1.35 KB) inquiry-ata.c Program to issue ATA INFORMATION requests to device Jason King, 2021-10-25 11:27 PM
Actions #1

Updated by Marcel Telka 11 days ago

I think the problem might be missing mutex_exit/mutex_enter calls around this code in ahci_restart_port_wait_till_ready():

8763      sata_hba_event_notify(ahci_ctlp->ahcictl_dip, &sdevice,
8764          SATA_EVNT_DEVICE_RESET);

I tried to add those calls more than 8 years ago when I worked on #3839, but such a change was refused: https://illumos.topicbox.com/groups/developer/T0bbacadf3d9c767b

Actions #2

Updated by Jason King 11 days ago

  • Assignee set to Jason King

I'm in the process of testing/verifying a fix, but I think the immediate issue is that if something issues an INQUIRY for the ATA INFORMATION page to the device, the sata framework's cport_mutex is held across the call into the HBA. All the other callers of sata_fetch_device_identify_data (or more specifically anyone invoking the HBA's start (sata_hba_start) method invoke the start function without that lock being held -- either they explicitly drop the lock themselves, or their callers do so.

Actions #3

Updated by Jason King 11 days ago

That said, the existing sata event handling code leaves much to be desired and could use improvement.

Actions #4

Updated by Rich Lowe 11 days ago

I strongly recommend fixing this first, and then improving the code.

Actions #5

Updated by Jason King 10 days ago

  • Description updated (diff)
Actions #6

Updated by Jason King 10 days ago

The plan is to fix the deadlock as a part of this change. Any other bits will have their own ticket.

Actions #7

Updated by Garrett D'Amore 10 days ago

So Marcel, I'm the one who blocked the change, and let me be the first to say I was wrong to do so, although I understand why I did.

I've since reviewed the code in rather gory detail, and I'm not super happy about this SATA framework code at all, but this one change (fixing the locks) is actually needed.

I have also written a change (which we may or may not upstream) which moves the notify stuff out of the synchronous path and into a taskq. I think that is more or less what I was suggesting 8 years ago, but I seem to have arrived at the same solution "independently" (of myself?) 8 years later on.

Anyway, the change is in our tree, and so far testing seems to hold up on it.

Actions #8

Updated by Electric Monk 3 days ago

  • Gerrit CR set to 1772
Actions #9

Updated by Jason King about 4 hours ago

Testing this has been a bit challenging due to just how tight the window is. However using the attached files, I've had some amount of success.

Using sata.d, I attempted to increase the window where a deadlock occurs.
At the same time, I used write-file.c to just write a 1MB block of random data to a file, flush, and start all over (overwriting the 1MB -- the intention is to generate write I/O and not have to worry about exhausting space). While all of that is running, ATA information requests are sent to the device.

With all of that running, I was able to produce the lockup a couple of times over the course of the day on an unpatched system. While it would be nice to be able to replicate it somewhat quicker and easier, it appears that NCQ I/Os (basically reads and writes) have the effect of starving out non-NCQ I/Os (such as the ATA INFORMATION request). As a result, it meant requiring having the AHCI ISR service an NCQ I/O completion (with no other NCQ I/Os outstanding), then in the small window where the ISR marks the slot available and the I/O callbacks are invoked, have the ATA Information go through without getting 'preempted' (for lack of a better term) by another NCQ I/O. Even with various attempts using chill() with dtrace, it proved to be challenging.

In comparison though, with this fix in place, the same setup that produced several hangs within a day ran for 3+ days non-stop without causing a system hang -- system responsiveness wasn't great, but considering it was slowing down I/O servicing on the root pool of the system, while inducing a fair amount of constant write I/O, that's not too surprising.

This is in addition to examining the other callers to sata_fetch_device_identify_data which all call it without cport_mutex being held -- which itself is a strong indicator that the ATA Information request should also follow suit -- obviously the stacks seen in the deadlock provide even stronger proof.

Actions

Also available in: Atom PDF