Project

General

Profile

Bug #12383

Slow down and lock up in mlxcx receive interrupt path

Added by Paul Winder 9 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

In our testing we observed the receive path would at times slow down and occasionally just appear to lock up.

The root cause is in function mlxcx_buf_take_n() which is called in the receive data path to refill the receive work queue with DMA buffers. The code in question is:

        mutex_enter(&s->mlbs_mtx);
        while (done < nbufs) {
                while (list_is_empty(&s->mlbs_free)) {
                        (void) cv_reltimedwait(&s->mlbs_free_nonempty,
                            &s->mlbs_mtx, wtime, TR_MILLISEC);
                        if (list_is_empty(&s->mlbs_free) &&
                            empty++ >= MLXCX_BUF_TAKE_N_MAX_RETRIES) {
                                mutex_exit(&s->mlbs_mtx);
                                return (done);
                        }
                }
                b = list_remove_head(&s->mlbs_free);
                ASSERT3U(b->mlb_state, ==, MLXCX_BUFFER_FREE);
                b->mlb_state = MLXCX_BUFFER_ON_WQ;
                list_insert_tail(&s->mlbs_busy, b);
                bp[done++] = b;
        }
        mutex_exit(&s->mlbs_mtx);

When the system slows down or hangs it is because it is waiting for the cv_reltimedwait() which is called from an interrupt thread - it is not a good idea to call any form of cv_wait() in an interrupt context as it blocks all interrupts at the same or lower priority. And before it exits it has to endure 3 * 5ms (which are rounded up to a tick ... so 10ms) timeouts.

In cases we saw, the interrupt thread appeared to be permanently blocked here.

It is looking for buffers on the free list to restock the receive queue, and in the lock up cases the free list is empty, but if you examine the buffers on the busy list I saw:

> ffffd0948fb413a0::walk list|::print -t mlxcx_buffer_t mlb_state
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
.
.
.
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)
mlxcx_buffer_state_t mlb_state = 2 (MLXCX_BUFFER_ON_WQ)

MLXCX_BUFFER_ON_LOAN means it has been loaned to a mblk, and MLXCX_BUFFER_ON_WQ means the buffer is assigned to a slot in the receive queue to receive data. So, the test for the empty free queue and blocking is a bit fallacious since the receive queue is in a state to receive data. Also the mblks which are MLXCX_BUFFER_ON_LOAN are all chained together waiting to be sent up the IP stack. But aren't being sent because we appear to be stuck inside mlxcx_buf_take_n(), and hence nothing ends up back on the free list.

To compound this there is an issue in the code which causes the mlxcx_buf_take_n() to be called on every interrupt. mlxcx_buf_take_n() is called by mlxcx_rq_refill(), and:

        if (!mlxcx_buf_loan(mlxp, buf)) {
                mlxcx_warn(mlxp, "!loan failed, dropping packet");
                mlxcx_buf_return(mlxp, buf);
                return (NULL);
        }
.
.
.
.  
      /*
         * Don't check if a refill is needed on every single completion,
         * since checking involves taking the RQ lock.
         */
        if ((buf->mlb_wqe_index & 0x7) == 0) {
                mlxcx_work_queue_t *wq = mlcq->mlcq_wq;
                ASSERT(wq != NULL);
                mutex_enter(&wq->mlwq_mtx);
                if (!(wq->mlwq_state & MLXCX_WQ_TEARDOWN))
                        mlxcx_rq_refill(mlxp, wq);
                mutex_exit(&wq->mlwq_mtx);
        }

The check (buf->mlb_wqe_index & 0x7) == 0 is always true, because the call to mlxcx_buf_loan(mlxp, buf) sets buf->mlb_wqe_index to zero. Which is exacerbating the issue.

We have code in test which completely removes all the code relating to the cv_reltimedwait() and instead will dispatch a taskq task to perform the refill only if the receive queue is getting low in buffers.


Related issues

Related to illumos gate - Bug #12438: mlxcx should pass receive messages to mac layer more frequentlyClosedPaul Winder

Actions
Related to illumos gate - Bug #12439: mlxcx send rings can overflowClosedPaul Winder

Actions
Related to illumos gate - Bug #12440: mlxcx should not block in the send pathClosedPaul Winder

Actions
Related to illumos gate - Bug #12441: mlxcx default queue sizes are a bit on the small sizeClosedPaul Winder

Actions
#1

Updated by Paul Winder 8 months ago

  • Related to Bug #12438: mlxcx should pass receive messages to mac layer more frequently added
#2

Updated by Paul Winder 8 months ago

  • Related to Bug #12439: mlxcx send rings can overflow added
#3

Updated by Paul Winder 8 months ago

  • Related to Bug #12440: mlxcx should not block in the send path added
#4

Updated by Paul Winder 8 months ago

  • Related to Bug #12441: mlxcx default queue sizes are a bit on the small size added
#5

Updated by Paul Winder 8 months ago

There are a series of related tickets (listed here).
The combined change for all these tickets produced the following set of throughput figures using iperf on a point-to-point 25Gb link with 9000 MTU

Threads 1 2 4 8 16 32 64 128
Gb/s 20.4 24.7 24.7 24.7 24.7 24.7 23.7 23.4

Similar runs were done with smaller MTUs. A high thread count and 1500 MTU would tend to fill the transmit rings, and the larger MTUs would fill the receive side. It was using varying MTUs and thread counts I was able to induce the different issues.

#6

Updated by Electric Monk 8 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 80 to 100

git commit 22d052287ba7ed169757650e2eec25fedbae163a

commit  22d052287ba7ed169757650e2eec25fedbae163a
Author: Paul Winder <pwinder@racktopsystems.com>
Date:   2020-04-14T15:40:07.000Z

    12383 Slow down and lock up in mlxcx receive interrupt path
    12438 mlxcx should pass receive messages to mac layer more frequently
    12439 mlxcx send rings can overflow
    12440 mlxcx should not block in the send path
    12441 mlxcx default queue sizes are a bit on the small size
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Andy Stormont <astormont@racktopsystems.com>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Approved by: Garrett D'Amore <garrett@damore.org>

#7

Updated by Electric Monk 8 months ago

git commit fd7fa860de2ce9f847175f3d39dfd19f8d5735f9

commit  fd7fa860de2ce9f847175f3d39dfd19f8d5735f9
Author: Dan McDonald <danmcd@joyent.com>
Date:   2020-04-14T21:50:44.000Z

    12383 Slow down and lock up in mlxcx receive interrupt path (fix smatch)

Also available in: Atom PDF