Bug #8792


mac_tx_srs_quiesce() can kill a maxbw link

Added by Ryan Zezeski over 3 years ago. Updated over 3 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


I originally reported this as SmartOS ticket OS-6151. I'm going to repeat the original synopsis and all follow up comments here as well for posterity.

Setting maxbw, especially while a link is actively processing data, can kill a link's Tx SRS -- preventing the userspace client from sending data.

When a link is in bandwidth mode (via setting the maxbw property) it makes use of a Tx SRS and worker thread. As the bandwidth limit is reached for a given tick the MAC module will queue data on the Tx SRS before completely dropping packets. As bandwidth becomes available the Tx worker will remove data from this queue and send it across the wire. This is a delicate dance performed between client threads which call mac_tx_srs_enqueue() and MAC Tx SRS worker threads which pull data off the queue via mac_srs_worker(). Since bandwidth is enforced on a per-tick basis the worker thread uses a continuous 1-tick timeout to reset its bandwidth counters. This heartbeat is vital and involves the mac_srs_fire() function, srs_async CV, timeout.9f, untimeout.9f, and MAC quiesce functionality (i.e. SRS_PAUSE).

If data is queued in the SRS then srs_first != NULL. That's why {{mac_tx_srs_enqueue}} has the logic copied below. If data is already queued then it doesn't wake the worker because it knows it was woken the first time this function was called.

    if (mac_srs->srs_first != NULL)
        wakeup_worker = B_FALSE;

The worker spins in a loop checking if a) there is no data to process or b) there is data to process but we've reached the bandwidth limit for this tick. In the (b) case it's going to set a timeout to execute mac_srs_fire() on the next tick and reset the bandwidth counters. Then it's going to wait for that timeout (other threads/functions can signal this but at this point the timeout is our only hope).

            if (bw_ctl_flag && mac_srs->srs_tid == NULL) {
                 * We need to ensure that a timer is already
                 * scheduled or we force schedule one for
                 * later so that we can continue processing
                 * after this quanta is over.
                mac_srs->srs_tid = timeout(mac_srs_fire,
                    mac_srs, 1);
            cv_wait(async, lock);
            CALLB_CPR_SAFE_END(&cprinfo, lock);

Once the worker is signaled it will recheck the bandwidth numbers and then call the drain function with the data it is allowed to process that tick.

This all works fine until you change the maxbw property while the worker is waiting. The key to this puzzle are the lines following the cv_wait():

            if (mac_srs->srs_state & SRS_PAUSE)
                goto done;
            if (mac_srs->srs_state & SRS_PROC)
                goto wait;

When you change maxbw on an active link it will call mac_tx_srs_update_bwlimit(). This will call mac_tx_client_quiesce() -- which will set the SRS_PAUSE flag on the SRS. This will then trip the if statement above and jump the worker past the srs_drain_func call and instead lead it directly to the mac_srs_worker_quiesce() call. This leads to mac_srs_soft_rings_quiesce() which leads to calling untimeout(srs_tid).

        (void) untimeout(mac_srs->srs_tid);
         * RPZ: since we are calling untimeout shouldn't we set
         * srs_tid=NULL?

However, as my comment points out, this function does't reset the srs_tid even though we just canceled it. This is a problem, because the only other function that resets the srs_tid is the drain function, but we jumped past that in mac_srs_worker() because of the special SRS_PAUSE logic.

So after the SRS has quiesced and we restart the workers we now have a situation where the worker thinks it has a timeout registered for mac_srs_fire() when it has no such thing. At this point the clients can queue all the data they want, but the worker thread, like Snow White, has eaten a poisoned apple and there is no prince of MAC to save her.

Actions #1

Updated by Ryan Zezeski over 3 years ago

This bug is very easy to reproduce.

  • Create a link.
  • Plumb an IP.
  • Set maxbw on the link.
  • Run iperf (or some other load gen) bound to that IP.
  • Change the maxbw setting on the link while the load gen is running.

To remove any doubt that my prognosis is correct I wrote a dtrace script to track the flow of Tx bytes over a link and detect when cv_wait() is called without a corresponding mac_srs_fire() timeout.

Here's the script:

#pragma D option quiet

#define SRS_PROC 0x00000010

 * Pull the MAC client name from the SRS. We hook into the worker
 * thread via mac_srs_worker_restart() because mac_srs_worker() is
 * just a for loop. This way we hook into the thread anytime maxbw is
 * set, as opposed to starting the script before the link is created.
fbt::mac_srs_worker_restart:entry {
    this->name = stringof(args[0]->srs_mcip->mci_name);

 * If this worker thread is running on behalf of the MAC client we are
 * interested in then stash the SRS pointer in a thread local variable
 * (which doubles as a flag indicating we are interested in this
 * thread).
fbt::mac_srs_worker_restart:entry /this->name == $1/ {
    self->srs = args[0];
    printf("Hooking into %s mac_srs_worker (0x%p)\n", this->name,

fbt::untimeout:entry /self->srs && self->srs->srs_tid == args[0]/ {
    printf("Canceling SRS timeout: 0x%p (worker 0x%p)\n",
        args[0], curthread);
    self->untimeout = args[0];

 * The cv_wait(async, lock) call in mac_srs_worker() relies on the
 * mac_srs_fire() timeout to occur when in bandwidth mode. If this
 * timeout is canceled but the srs_tid isn't cleared then
 * mac_srs_worker() will think there is an outstanding timeout waiting
 * to fire when there isn't -- thus waiting on the async signal
 * forever.
 * In this case you'll see the MAC client repeatedly call
 * max_tx_srs_enqueue() -- enqueuing data well past the BW queue limit
 * and waiting for the worker thread to drain it.
fbt::cv_wait:entry /self->srs && !(self->srs->srs_state & SRS_PROC) &&
    self->srs->srs_tid != NULL && self->untimeout == self->srs->srs_tid/ {
    printf("OOPS! %s waiting on timeout that will never come!\n",
    wait_start = timestamp;
    wait_thread = curthread;

 * Remember, bytes are queued by the client on a different thread than
 * worker. We need to check the name here again.
fbt::mac_tx_srs_enqueue:entry /stringof(args[0]->srs_mcip->mci_name) == $1/ {
    self->srs = args[0];
    self->q_before = self->srs->srs_size;

fbt::mac_tx_srs_enqueue:entry /self->srs && self->srs->srs_size > 0 &&
    wait_start != 0 && ((timestamp - wait_start) > 5000000000)/ {
    printf("ERROR: mac_srs_worker 0x%p has been waiting for 5 seconds.\n",
    printf("There is data to deliver -- it should have been signaled!\n");
    printf("Use mdb -k to inspect the mac_soft_ring_set_t: 0x%p\n",

fbt::mac_tx_srs_enqueue:return /self->srs/ {
    this->queued = self->srs->srs_size - self->q_before;
    printf("%u bytes queued on %s (%u total)\n", this->queued,
        self->srs->srs_mcip->mci_name, self->srs->srs_size);
    self->q_before = 0;
    self->srs = 0;

 * The drain function is called from the SRS worker thread so we just
 * check self->srs which was set via the probe on
 * mac_srs_worker_restart().
fbt::mac_tx_srs_drain:entry /self->srs/ {
    self->q_before = self->srs->srs_size;
    enqueue_count = 0;

fbt::mac_tx_srs_drain:return /self->srs/ {
    this->drained = self->q_before - self->srs->srs_size;
    printf("%u bytes drained on %s (worker 0x%p) (%u remain)\n",
        this->drained, self->srs->srs_mcip->mci_name, curthread,
    self->q_before = 0;

To reproduce I run the following steps:

# dladm create-vnic -t -l e1000g0 tvn1
# ipadm create-addr -t -T static -a tvn1/v4
# dtrace -Cs maxbw-stuck.d '"tvn1"'
# dladm set-linkprop -p maxbw=20 tvn1

In other terminal on the same host:
# iperf -c -B -p 5007 -i 1 -t 30

Then back to the first terminal:
# dladm set-linkprop -p maxbw=19 tvn1

When you change maxbw you'll notice the iperf output stops (because it can't send data) and you'll see a change in the dtrace script output from interleavings of queued/drained message to purely queued messages:

Script output when things are running fine:

1514 bytes queued on tvn1 (1080120 total)
1514 bytes queued on tvn1 (1081634 total)
1514 bytes queued on tvn1 (1083148 total)
1514 bytes queued on tvn1 (1084662 total)
1514 bytes queued on tvn1 (1086176 total)
3028 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1083148 remain)
3028 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1080120 remain)
3846 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1076274 remain)
3028 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1073246 remain)

Script output after changing maxbw under load and things go sideways:

Canceling SRS timeout: 0x17ff4654d (worker 0xffffff0040dedc40)
Hooking into tvn1 mac_srs_worker (0xffffff0040dedc40)
OOPS! tvn1 waiting on timeout that will never come!

Hooking into tvn1 mac_srs_worker (0xffffff00421aac40)
1514 bytes queued on tvn1 (1081634 total)
1514 bytes queued on tvn1 (1083148 total)
1514 bytes queued on tvn1 (1084662 total)
1514 bytes queued on tvn1 (1086176 total)
1514 bytes queued on tvn1 (1087690 total)
1514 bytes queued on tvn1 (1089204 total)
1514 bytes queued on tvn1 (1090718 total)
1514 bytes queued on tvn1 (1092232 total)
1514 bytes queued on tvn1 (1093746 total)
1514 bytes queued on tvn1 (1095260 total)
1514 bytes queued on tvn1 (1096774 total)
1514 bytes queued on tvn1 (1098288 total)
150 bytes queued on tvn1 (1098438 total)
54 bytes queued on tvn1 (1098492 total)
1514 bytes queued on tvn1 (1100006 total)
150 bytes queued on tvn1 (1100156 total)
ERROR: mac_srs_worker 0xffffff0040dedc40 has been waiting for 5 seconds.
There is data to deliver -- it should have been signaled!
Use mdb -k to inspect the mac_soft_ring_set_t: 0xffffff0e4ee5d040
Actions #2

Updated by Ryan Zezeski over 3 years ago

It turns out that setting maxbw is just a specific instance of a more general problem. The real issue is that any time you quiesce the Tx SRS worker you run the chance of killing the link. Today I tripped this same issue by just running a maxbw test program. It turns out that the worker is also quiesced on stream detach (dld_str_detach()). This induces the same problem as described above: we jump out of the mac_srs_worker() for(;;) loop and untimeout() the srs_tid without setting it to NULL. Then we end up in a situation where the worker waits on async forever and the max_tx_bw_mode() just keeps piling data up in the SRS queue.

For example, I ran this dtrace script:

# dtrace -n 'fbt::mac_tx_srs_quiesce:entry { printf("srs_tid: 0x%p\n", args[0]->srs_tid); stack(); } fbt::mac_srs_soft_rings_quiesce:entry { self->t=1; } fbt::untimeout:entry /self->t/ { printf("untimeout: 0x%p\n", arg0); stack(); self->t=0; }'

And then I ran my new maxbw test which uses DLPI (thus STREAMS) to send custom ethertype frrames across two vnics on an etherstub. When the test finishes the stream is closed and detaches. Causing the following output from the dtrace script:

  9  28760         mac_tx_srs_quiesce:entry srs_tid: 0x17fff706b


  9  24212                  untimeout:entry untimeout: 0x17fff706b


Then if I try to run the maxbw test a second time it hangs because it can't get any data across the link. If I inspect the Tx SRS in mdb I find the following:

> ::mac_srs ! grep vnic1
ffffff0e307029c0 maxbw_vnic1          ffffff0e306d0bb8                0 TX
ffffff0e30703680 maxbw_vnic1          ffffff0e306d0bb8                0 RX
> ffffff0e307029c0::print mac_soft_ring_set_t srs_tid
srs_tid = 0x17fff706b

So the timer was canceled (proven by dtrace output) but yet the SRS still thinks that srs_tid is valid and thus will never start a new timer -- effectively killing this link.

Actions #3

Updated by Electric Monk over 3 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit da906b4e8cca4344750ec24c7007b5a3f26e8919

commit  da906b4e8cca4344750ec24c7007b5a3f26e8919
Author: Ryan Zezeski <>
Date:   2018-01-22T19:04:20.000Z

    8792 mac_tx_srs_quiesce() can kill a maxbw link
    Reviewed by: Robert Mustacchi <>
    Reviewed by: Ken Mays <>
    Reviewed by: Toomas Soome <>
    Reviewed by: randyf <>
    Approved by: Dan McDonald <>


Also available in: Atom PDF