Bug #7190

e1000g receive DMA addresses out of sync

Added by Arne Jansen about 4 years ago. Updated about 4 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


After a network failover the network interface stopped working. Packets got received with several seconds to minutes delay and with the wrong length. The adapter is an Intel 82574L (e1000g).
During normal operation, the hardware reads a receive descriptor, copies the received data to the specified host address via DMA, updates len and status in the descriptor and writes it back to host memory. Afterward it updates its HEAD register. During this, the dma address (buffer_addr) in the descriptor should not change. In our case though, the DMA address changes during descriptor writeback. It looks like the hardware is reading the descriptor from the wrong address, but correctly writes it back to HEAD. At least in several occasions the address that got written back has been read from HEAD+0x80.
The datasheet states that HEAD is only allowed to be written directly after a hardware reset and before the receiver gets enabled. Otherwise the hardware becomes "unstable". I did some experiments and found that if one writes to HEAD during normal operations, the hardware indeed starts to read the descriptors from the wrong address. The behavior becomes very erratic.
So my working theory is that somehow the driver wrote to the HEAD during normal operation, triggered by the link event. The only function that writes to HEAD is e1000g_rx_setup, called from e1000g_start. If we can find a code path that calls e1000g_start without a preceding e1000g_stop that could explain the bug. Unfortunately I haven't been able to find such a path yet. Also, the bug is not easily reproducible. It only occurred on about 1% of our machines.



Updated by Arne Jansen about 4 years ago

As a workaround, we're currently testing the following patch:

diff --git a/usr/src/uts/common/io/e1000g/e1000g_rx.c b/usr/src/uts/common/io/e1
index 57c1d0d..303c3bd 100644
--- a/usr/src/uts/common/io/e1000g/e1000g_rx.c
+++ b/usr/src/uts/common/io/e1000g/e1000g_rx.c
@@ -547,6 +547,16 @@ e1000g_receive(e1000g_rx_ring_t *rx_ring, mblk_t **tail, ui
                        goto rx_drop;

+               /*
+                * workaround for #7190. After a switch reset packet queue
+                * and descriptor dma addresses got out of sync. Detect
+                * this and flag the error. Let the watchdog timer do the reset
+                */
+               if (current_desc->buffer_addr != rx_buf->dma_address) {
+                       e1000g_log(Adapter, CE_WARN, "receive dma descriptors " 
+                           "got out of sync, resetting adapter");
+                       Adapter->e1000g_state |= E1000G_ERROR;
+               }
                accept_frame = (current_desc->errors == 0) ||
                    ((current_desc->errors &
                    (E1000_RXD_ERR_TCPE | E1000_RXD_ERR_IPE)) != 0);

Also available in: Atom PDF