Project

General

Profile

Bug #12383

Slow down and lock up in mlxcx receive interrupt path

Added by Paul Winder 11 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

In our testing we observed the receive path would at times slow down and occasionally just appear to lock up.

The root cause is in function mlxcx_buf_take_n() which is called in the receive data path to refill the receive work queue with DMA buffers. The code in question is:

        mutex_enter(&s->mlbs_mtx);
        while (done < nbufs) {
                while (list_is_empty(&s->mlbs_free)) {
                        (void) cv_reltimedwait(&s->mlbs_free_nonempty,
                            &s->mlbs_mtx, wtime, TR_MILLISEC);
                        if (list_is_empty(&s->mlbs_free) &&
                            empty++ >= MLXCX_BUF_TAKE_N_MAX_RETRIES) {
                                mutex_exit(&s->mlbs_mtx);
                                return (done);
                        }
                }
                b = list_remove_head(&s->mlbs_free);
                ASSERT3U(b->mlb_state, ==, MLXCX_BUFFER_FREE);
                b->mlb_state = MLXCX_BUFFER_ON_WQ;
                list_insert_tail(&s->mlbs_busy, b);
                bp[done++] = b;
        }
        mutex_exit(&s->mlbs_mtx);

When the system slows down or hangs it is because it is waiting for the cv_reltimedwait() which is called from an interrupt thread - it is not a good idea to call any form of cv_wait() in an interrupt context as it blocks all interrupts at the same or lower priority. And before it exits it has to endure 3 * 5ms (which are rounded up to a tick ... so 10ms) timeouts.

In cases we saw, the interrupt thread appeared to be permanently blocked here.

It is looking for buffers on the free list to restock the receive queue, and in the lock up cases the free list is empty, but if you examine the buffers on the busy list I saw:

> ffffd0948fb413a0::walk list|::print -t mlxcx_buffer_t mlb_state
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
.
.
.
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)

MLXCX_BUFFER_ON_LOAN means it has been loaned to a mblk, and MLXCX_BUFFER_ON_WQ means the buffer is assigned to a slot in the receive queue to receive data. So, the test for the empty free queue and blocking is a bit fallacious since the receive queue is in a state to receive data. Also the mblks which are MLXCX_BUFFER_ON_LOAN are all chained together waiting to be sent up the IP stack. But aren't being sent because we appear to be stuck inside mlxcx_buf_take_n(), and hence nothing ends up back on the free list.

To compound this there is an issue in the code which causes the mlxcx_buf_take_n() to be called on every interrupt. mlxcx_buf_take_n() is called by mlxcx_rq_refill(), and:

        if (!mlxcx_buf_loan(mlxp, buf)) {
                mlxcx_warn(mlxp, "!loan failed, dropping packet");
                mlxcx_buf_return(mlxp, buf);
                return (NULL);
        }
.
.
.
.  
      /*
         * Don't check if a refill is needed on every single completion,
         * since checking involves taking the RQ lock.
         */
        if ((buf->mlb_wqe_index & 0x7) == 0) {
                mlxcx_work_queue_t *wq = mlcq->mlcq_wq;
                ASSERT(wq != NULL);
                mutex_enter(&wq->mlwq_mtx);
                if (!(wq->mlwq_state & MLXCX_WQ_TEARDOWN))
                        mlxcx_rq_refill(mlxp, wq);
                mutex_exit(&wq->mlwq_mtx);
        }

The check (buf->mlb_wqe_index & 0x7) == 0 is always true, because the call to mlxcx_buf_loan(mlxp, buf) sets buf->mlb_wqe_index to zero. Which is exacerbating the issue.

We have code in test which completely removes all the code relating to the cv_reltimedwait() and instead will dispatch a taskq task to perform the refill only if the receive queue is getting low in buffers.


Related issues

Related to illumos gate - Bug #12438: mlxcx should pass receive messages to mac layer more frequentlyClosedPaul Winder

Actions
Related to illumos gate - Bug #12439: mlxcx send rings can overflowClosedPaul Winder

Actions
Related to illumos gate - Bug #12440: mlxcx should not block in the send pathClosedPaul Winder

Actions
Related to illumos gate - Bug #12441: mlxcx default queue sizes are a bit on the small sizeClosedPaul Winder

Actions

Also available in: Atom PDF